Flume用來收集日誌信息,這裏記錄以下使用場景:
場景一:使用avro source ,memory,logger 將收集到的日誌打印在標準輸出,適合測試。
場景二:使用avro source,kafka channel,hdfs 將日誌以"Flume Event" Avro Event Serializer 的形式保存在hdfs上,這種方式生成的.avro文件中的每一條記錄的字段中包含headers和body兩部分內容,其中headers是flume event的頭部消息,body是一個bytes類型的數組,這是真正的需要的數據。讀取這種方式生成的.avro 提取的schema如下:
{"type":"record","name":"Event","fields":[{"name":"headers","type":{"type":"map","values":"string"}},{"name":"body","type":"bytes"}]}
場景三:使用avro source, kafka channel,hdfs 將日誌以Avro Event Serializer的形式保存,即希望保存後的.avro能夠被直接讀取用來做統計計算,這裏根據所選序列化類的不同有兩種不同的配置方法:
第一種是:使用CDH 提供的org.apache.flume.serialization.AvroEventSerializer$Builder 序列化類來序列化消息到Hdfs 上,這個序列化器要求在flume event 的headers 中包含"flume.avro.schema.literal" 或"flume.avro.schema.url" 來指定數據所使用的avro schema,否則數據將無法保存,flume報錯。使用這種配置需要注意:
(1) sink端的序列化器使用cdk提供的,git上有源碼,可以自己下載編譯(我編譯失敗了),也可以直接下載Jar:https://repository.cloudera.com/content/repositories/releases/com/cloudera/cdk/cdk-flume-avro-event-serializer/0.9.2/
第二種是:使用apache flume提供的org.apache.flume.sink.hdfs.AvroEventSerializer$Builder 序列化類作爲其保存數據到hdfs上的工具,這個序列化類不要求在event header中包含avro schema信息,但是在sink端指定avro schema url 。
這兩種配置能夠成功需要注意以下一點:
(1)使用avro client發送數據時,非String數據,例如常用的對象數據需要轉化成字節數組,然後才能發送,這裏需要注意的是不能把java 對象直接轉化爲字節數據,而應該使用avro 提供api按照給定的schema進行轉化,否則從hdfs讀取avro格式的文件是不可用的,因爲avro無法進行decoder.
場景四:利用Spooling Directory Source,memory,hdfs 監控生成的avro文件,相當於把avro從源服務器上傳到目標服務器,但是需要注意以下幾點
(1)應該把要上傳的avro文件在非Spooling Directory目錄下整理完成後再移動到Spooling Directory,否則agent將會出現錯誤。
(2)當Spooling Directory目錄下的文件已經被flume讀取後,任何更改都是無效的。
(3)f放到Spooling Directory目錄下的所有文件命名要唯一,否則會引發錯誤。
下面詳細記錄每種場景下的配置及相關的java 代碼:
場景一詳解:
這種方式是官網提供的配置及demo,它能讓使用者快速感受到flume的作用。
配置如下:
a1.channels = c1
a1.sources = r1
a1.sinks = k1
a1.channels.c1.type = memory
a1.sources.r1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.bind = 127.0.0.1
a1.sources.r1.port = 41414
a1.sinks.k1.channel = c1
a1.sinks.k1.type = logger
發送avro事件的客戶端如下:
package com.learn.flume;
import com.learn.model.UserModel;
import org.apache.avro.Schema;
import org.apache.avro.io.BinaryEncoder;
import org.apache.avro.io.EncoderFactory;
import org.apache.avro.reflect.ReflectDatumWriter;
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.api.RpcClient;
import org.apache.flume.api.RpcClientFactory;
import org.apache.flume.event.EventBuilder;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;
import java.util.Properties;
public class SendAvroClient {
public static void main(String[] args) throws IOException {
RpcClientFacade client = new RpcClientFacade();
// Initialize client with the remote Flume agent's host and port
client.init("127.0.0.1", 41414);
// Send 10 events to the remote Flume agent. That agent should be
// configured to listen with an AvroSource.
String sampleData = "china";
for (int i = 0; i < 10; i++) {
client.sendStringDataToFlume(sampleData);
}
client.cleanUp();
}
}
class RpcClientFacade {
private RpcClient client;
private String hostname;
private int port;
private static Properties p= new Properties();
static {
p.put("client.type","default");
p.put("hosts","h1");
p.put("hosts.h1","127.0.0.1:41414");
p.put("batch-size",100);
p.put("connect-timeout",20000);
p.put("request-timeout",20000);
}
public void init(String hostname, int port) {
// Setup the RPC connection
this.hostname = hostname;
this.port = port;
this.client = RpcClientFactory.getInstance(p);
if (this.client == null) {
System.out.println("init client fail");
}
// this.client = RpcClientFactory.getInstance(hostname, port);
// Use the following method to create a thrift client (instead of the above line):
// this.client = RpcClientFactory.getThriftInstance(hostname, port);
}
public void sendStringDataToFlume(String data) {
// Create a FEventBuil
Event event = EventBuilder.withBody(data, Charset.forName("UTF-8"));
//event.getHeaders().put("kkkk","aaaaa");
// Send the event
try {
client.append(event);
} catch (EventDeliveryException e) {
// clean up and recreate the client
client.close();
client = null;
client = RpcClientFactory.getDefaultInstance(hostname, port);
// Use the following method to create a thrift client (instead of the above line):
// this.client = RpcClientFactory.getThriftInstance(hostname, port);
e.printStackTrace();
}
}
public void cleanUp() {
// Close the RPC connection
client.close();
}
}
這裏利用Netty客戶端發送flume消息,但是發送的消息需要是String類型,因爲最終消息內容是以字節的形式在Flume中傳遞的。
場景二詳解:場景二使用 "Flume Event" Avro Event Serializer作爲其序列化器,最終保存在hdfs上的數據除了數據本省內容外還包括Flume event headers 消息部分。
下面是其在Flume中的配置:
a1.channels = kafka_channel
a1.sources = r1
a1.sinks = k1
a1.channels.kafka_channel.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.kafka_channel.kafka.bootstrap.servers=localhost:9092
a1.channels.kafka_channel.kafka.topic=test
a1.channels.kafka_channel.group.id=flume_group
a1.sources.r1.channels = kafka_channel
a1.sources.r1.type = avro
#添加攔截器
a1.sources.r1.interceptors=i1
a1.sources.r1.interceptors.i1.type=timestamp
a1.sources.r1.bind = 127.0.0.1
a1.sources.r1.port = 41414
##************k1 使用 "Flume Event" Avro Event Serializer start*************************#
a1.sinks.k1.channel = kafka_channel
a1.sinks.k1.type = hdfs
#選用默認的分區粒度
##a1.sinks.k1.hdfs.path=hdfs://localhost:9000/flume_data
#選擇天作爲分區粒度,這裏需要啓動header中的timestap功能,即給source添加攔截器
a1.sinks.k1.hdfs.path=hdfs://localhost:9000/flume_data/year=%Y/moth=%m/day=%d
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text
a1.sinks.k1.hdfs.fileSuffix=.avro
#set in use file prefix is "_",because hadoop mapreduce will ignore those files start with "_" prefix.
a1.sinks.k1.hdfs.inUsePrefix=_
a1.sinks.k1.serializer=avro_event
##************k1 使用 "Flume Event" Avro Event Serializer end*************************#
avro 客戶端代碼:
package com.learn.flume;
import com.learn.model.UserModel;
import com.learn.utils.ByteArrayUtils;
import org.apache.avro.Schema;
import org.apache.avro.io.BinaryEncoder;
import org.apache.avro.io.EncoderFactory;
import org.apache.avro.reflect.ReflectDatumWriter;
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.api.RpcClient;
import org.apache.flume.api.RpcClientFactory;
import org.apache.flume.event.EventBuilder;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.Properties;
public class SendAvroClient {
public static void main(String[] args) throws IOException {
RpcClientFacade client = new RpcClientFacade();
// Initialize client with the remote Flume agent's host and port
client.init("127.0.0.1", 41414);
for (int i = 0; i < 10; i++) {
UserModel userModel = new UserModel();
userModel.setAddress("hangzhou");
userModel.setAge(26);
userModel.setJob("it");
userModel.setName("shenjin");
client.sendObjectDataToFlume(userModel);
}
client.cleanUp();
}
}
class RpcClientFacade {
private RpcClient client;
private String hostname;
private int port;
private static Properties p = new Properties();
static {
p.put("client.type", "default");
p.put("hosts", "h1");
p.put("hosts.h1", "127.0.0.1:41414");
p.put("batch-size", 100);
p.put("connect-timeout", 20000);
p.put("request-timeout", 20000);
}
public void init(String hostname, int port) {
// Setup the RPC connection
this.hostname = hostname;
this.port = port;
this.client = RpcClientFactory.getInstance(p);
if (this.client == null) {
System.out.println("init client fail");
}
}
public void sendStringDataToFlume(String data) {
// Create a FEventBuil
Event event = EventBuilder.withBody(data, Charset.forName("UTF-8"));
// Send the event
try {
client.append(event);
} catch (EventDeliveryException e) {
// clean up and recreate the client
client.close();
client = null;
client = RpcClientFactory.getDefaultInstance(hostname, port);
e.printStackTrace();
}
}
public void sendObjectDataToFlume(Object data) throws IOException {
Event event = EventBuilder.withBody(ByteArrayUtils.objectToBytes(data).get());
// Send the event
try {
client.append(event);
} catch (EventDeliveryException e) {
// clean up and recreate the client
client.close();
client = null;
client = RpcClientFactory.getDefaultInstance(hostname, port);
// Use the following method to create a thrift client (instead of the above line):
// this.client = RpcClientFactory.getThriftInstance(hostname, port);
e.printStackTrace();
}
}
public void cleanUp() {
// Close the RPC connection
client.close();
}
}
這個客戶端的方法是基於場景一的改動,不同的是提供了 sendObjectDataToFlume(Object data) 用法用來發送java對象,而不是簡單的String類型的消息,這裏就需要把Object 轉換爲bytes數組後,調用客戶端提供的方法發送消息,於是編寫了java對象與bytes數組之間轉換的工具類,如下:
package com.learn.utils;
import java.io.*;
import java.util.Optional;
public class ByteArrayUtils {
/**
* java 對象轉字節數據
* @param obj
* @param <T>
* @return
*/
public static<T> Optional<byte[]> objectToBytes(T obj){
byte[] bytes = null;
ByteArrayOutputStream out = new ByteArrayOutputStream();
ObjectOutputStream sOut;
try {
sOut = new ObjectOutputStream(out);
sOut.writeObject(obj);
sOut.flush();
bytes= out.toByteArray();
} catch (IOException e) {
e.printStackTrace();
}
return Optional.ofNullable(bytes);
}
/**
* 字節數組轉java 對象
* @param bytes
* @param <T>
* @return
*/
public static<T> Optional<T> bytesToObject(byte[] bytes) {
T t = null;
ByteArrayInputStream in = new ByteArrayInputStream(bytes);
ObjectInputStream sIn;
try {
sIn = new ObjectInputStream(in);
t = (T)sIn.readObject();
} catch (Exception e) {
e.printStackTrace();
}
return Optional.ofNullable(t);
}
}
這樣消息能夠順利發送並以avro格式存儲在Hdfs,然後使用avro提供的api讀取文件的內容,並且反序列化爲對象:
public static void read2() throws IOException {
//數據的schema已經存在文件的頭部,反序列化時不需要指定schema
Configuration configuration = new Configuration();
String hdfsURI = "hdfs://localhost:9000/";
String hdfsFileURL = "flume_data/year=2018/moth=02/day=07/FlumeData.1517970870974.avro";
FileContext fileContext = FileContext.getFileContext(URI.create(hdfsURI), configuration);
//FileSystem hdfs = FileSystem.get(URI.create(hdfsURI), configuration);
//FSDataInputStream inputStream = hdfs.open(new Path(hdfsURI+hdfsFileURL));
AvroFSInput avroFSInput = new AvroFSInput(fileContext,new Path(hdfsURI+hdfsFileURL));
DatumReader<GenericRecord> genericRecordDatumReader = new GenericDatumReader<>();
DataFileReader<GenericRecord> fileReader = new DataFileReader<GenericRecord>(avroFSInput,genericRecordDatumReader);
//可以從讀取的文件中獲取schema信息
Schema schema = fileReader.getSchema();
System.out.println("Get Schema Info:"+schema);
GenericRecord genericUser = null;
while (fileReader.hasNext()) {
//這裏將需要讀取的對象傳入next方法中,用來減小創建對象的開銷和垃圾回收
genericUser = fileReader.next(genericUser);
byte[] o = ((ByteBuffer)genericUser.get("body")).array();
UserModel userModel= (UserModel)(((Optional)ByteArrayUtils.bytesToObject(o)).get());
System.out.println(userModel);
System.out.println(userModel.age);
}
fileReader.close();
}
這樣就可以獲取存儲的信息,但是感覺很麻煩。使用的model如下:
package com.learn.model;
import java.io.Serializable;
public class UserModel implements Serializable{
public String name ;
public Integer age;
private String address;
private String job;
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public Integer getAge() {
return age;
}
public void setAge(Integer age) {
this.age = age;
}
public String getAddress() {
return address;
}
public void setAddress(String address) {
this.address = address;
}
public String getJob() {
return job;
}
public void setJob(String job) {
this.job = job;
}
@Override
public String toString() {
return "UserModel{" +
"name='" + name + '\'' +
", age=" + age +
", address='" + address + '\'' +
", job='" + job + '\'' +
'}';
}
}
場景3詳解(重):個人認爲場景三在生產環境使用比較合適的,由服務器把需要記錄的信息以對象的形式通過avro RPC發送到flume,然後存儲在hdfs上,相對於場景二它只包含消息內容本身,可以用於計算引擎進行讀取計算。
Flume 配置如下:
方式一:使用非apache flume提供的序列化類,即使用org.apache.flume.serialization.AvroEventSerializer$Builder類序列化,配置如下:
a1.channels = kafka_channel
a1.sources = r1
a1.sinks = k2
a1.channels.kafka_channel.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.kafka_channel.kafka.bootstrap.servers=localhost:9092
a1.channels.kafka_channel.kafka.topic=test
a1.channels.kafka_channel.group.id=flume_group
a1.sources.r1.channels = kafka_channel
a1.sources.r1.type = avro
#添加攔截器
a1.sources.r1.interceptors=i1 i2
a1.sources.r1.interceptors.i1.type=timestamp
#必須指定avro schema url
a1.sources.r1.interceptors.i2.type=static
a1.sources.r1.interceptors.i2.key=flume.avro.schema.url
a1.sources.r1.interceptors.i2.value=hdfs://localhost:9000/flume_schema/UserModel.avsc
a1.sources.r1.bind = 127.0.0.1
a1.sources.r1.port = 41414
##************k2 使用 Avro Event Serializer start*************************#
a1.sinks.k2.channel = kafka_channel
a1.sinks.k2.type = hdfs
#選擇天作爲分區粒度,這裏需要啓動header中的timestap功能,即給source添加攔截器
a1.sinks.k2.hdfs.path=hdfs://localhost:9000/flume_data/year=%Y/moth=%m/day=%d
a1.sinks.k2.hdfs.fileType=DataStream
a1.sinks.k2.hdfs.writeFormat=Text
a1.sinks.k2.hdfs.fileSuffix=.avro
#set in use file prefix is "_",because hadoop mapreduce will ignore those files start with "_" prefix.
a1.sinks.k2.hdfs.inUsePrefix=_
a1.sinks.k2.serializer=org.apache.flume.serialization.AvroEventSerializer$Builder
a1.sinks.k2.hdfs.rollSize = 134217728
a1.sinks.k2.hdfs.rollInterval=30
##************k2 使用 Avro Event Serializer end*************************#
這個配置中需要注意的有以下幾點:
(1)在avro source 中添加了兩個攔截器,一個是timestap類型的,用來決定在hdfs上文件存儲的目錄結構的,另外一個是static攔截器,這裏面指定了當前消息對應的avro schema,這個是sink serializer需要的參數,以上兩個攔截器將消息加載在flume event 的headers中,並不影響消息體。
(2 )在hdfs sink 中需要注意指定的序列化類
a1.sinks.k2.serializer=org.apache.flume.serialization.AvroEventSerializer$Builder
並不是Flume 自身提供的那個類,需要重新下載jar包,下載完成後把它丟到flume安裝目錄的lib目錄下,jar包下載方式有兩種:
一是源碼自己編譯:https://github.com/cloudera/cdk
二是下載已經編譯好的jar:https://repository.cloudera.com/content/repositories/releases/com/cloudera/cdk/cdk-flume-avro-event-serializer/0.9.2/
方式二:使用apache flume提供的序列化類,即使用org.apache.flume.sink.hdfs.AvroEventSerializer$Builder類序列化,配置如下:
a1.channels = kafka_channel
a1.sources = r1
a1.sinks = k2
a1.channels.kafka_channel.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.kafka_channel.kafka.bootstrap.servers=localhost:9092
a1.channels.kafka_channel.kafka.topic=test
a1.channels.kafka_channel.group.id=flume_group
a1.sources.r1.channels = kafka_channel
a1.sources.r1.type = avro
#添加攔截器
a1.sources.r1.interceptors=i1
a1.sources.r1.interceptors.i1.type=timestamp
a1.sources.r1.bind = 127.0.0.1
a1.sources.r1.port = 41414
##************k2 使用 Avro Event Serializer start*************************#
a1.sinks.k2.channel = kafka_channel
a1.sinks.k2.type = hdfs
#選擇天作爲分區粒度,這裏需要啓動header中的timestap功能,即給source添加攔截器
a1.sinks.k2.hdfs.path=hdfs://localhost:9000/flume_data/year=%Y/moth=%m/day=%d
a1.sinks.k2.hdfs.fileType=DataStream
a1.sinks.k2.hdfs.writeFormat=Text
a1.sinks.k2.hdfs.fileSuffix=.avro
#set in use file prefix is "_",because hadoop mapreduce will ignore those files start with "_" prefix.
a1.sinks.k2.hdfs.inUsePrefix=_
a1.sinks.k2.serializer=org.apache.flume.sink.hdfs.AvroEventSerializer$Builder
a1.sinks.k2.serializer.schemaURL = hdfs://localhost:9000/flume_schema/UserModel.avsc
a1.sinks.k2.hdfs.rollSize = 134217728
a1.sinks.k2.hdfs.rollInterval=30
##************k2 使用 Avro Event Serializer end*************************#
這個序列化類是apache flume提供的,不需要額外的jar包。
Avro 客戶端的代碼如下:
package com.learn.flume;
import com.google.gson.JsonObject;
import com.learn.model.UserModel;
import com.learn.utils.ByteArrayUtils;
import org.apache.avro.Schema;
import org.apache.avro.io.BinaryEncoder;
import org.apache.avro.io.EncoderFactory;
import org.apache.avro.reflect.ReflectDatumWriter;
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.api.NettyAvroRpcClient;
import org.apache.flume.api.RpcClient;
import org.apache.flume.api.RpcClientFactory;
import org.apache.flume.event.EventBuilder;
import org.apache.flume.event.JSONEvent;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;
import java.util.Properties;
public class MyApp {
public static void main(String[] args) throws IOException {
MyRpcClientFacade client = new MyRpcClientFacade();
// Initialize client with the remote Flume agent's host and port
client.init("127.0.0.1", 41414);
// Send 10 events to the remote Flume agent. That agent should be
// configured to listen with an AvroSource.
for (int i = 0; i < 10; i++) {
UserModel userModel = new UserModel();
userModel.setAddress("hangzhou");
userModel.setAge(26);
userModel.setJob("it");
userModel.setName("shenjin");
client.sendObjectDataToFlume(userModel);
}
client.cleanUp();
}
}
class MyRpcClientFacade {
private RpcClient client;
private String hostname;
private int port;
private static Properties p = new Properties();
static {
p.put("client.type", "default");
p.put("hosts", "h1");
p.put("hosts.h1", "127.0.0.1:41414");
p.put("batch-size", 100);
p.put("connect-timeout", 20000);
p.put("request-timeout", 20000);
}
public void init(String hostname, int port) {
// Setup the RPC connection
this.hostname = hostname;
this.port = port;
this.client = RpcClientFactory.getInstance(p);
if (this.client == null) {
System.out.println("init client fail");
}
}
public void sendStringDataToFlume(String data) {
// Create a FEventBuil
Event event = EventBuilder.withBody(data, Charset.forName("UTF-8"));
event.getHeaders().put("kkkk", "aaaaa");
// Send the event
try {
client.append(event);
} catch (EventDeliveryException e) {
// clean up and recreate the client
client.close();
client = null;
client = RpcClientFactory.getDefaultInstance(hostname, port);
e.printStackTrace();
}
}
public void sendObjectDataToFlume(Object data) throws IOException {
InputStream inputStream = ClassLoader.getSystemResourceAsStream("schema/UserModel.avsc");
Schema schema = new Schema.Parser().parse(inputStream);
Event event = EventBuilder.withBody(serializeAvro(data, schema));
try {
client.append(event);
} catch (EventDeliveryException e) {
// clean up and recreate the client
client.close();
client = null;
client = RpcClientFactory.getDefaultInstance(hostname, port);
e.printStackTrace();
}
}
public void cleanUp() {
// Close the RPC connection
client.close();
}
/**
* 使用avro提供的編碼器生成字節數組,否則存取在hdfs上的數據無法讀取
* @param datum
* @param schema
* @return
* @throws IOException
*/
public byte[] serializeAvro(Object datum, Schema schema) throws IOException {
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
ReflectDatumWriter<Object> writer = new ReflectDatumWriter<>(schema);
BinaryEncoder binaryEncoder = EncoderFactory.get().binaryEncoder(outputStream, null);
outputStream.reset();
writer.write(datum, binaryEncoder);
binaryEncoder.flush();
return outputStream.toByteArray();
}
}
這裏與場景二中最大的不同就是java對象轉字節數組的方式,場景二中使用的是常用的java對象轉字節數組的方式,而這裏需要結合avro提供的編碼器和scheme來生成對象的字節數組,然後通過RPC發送,否則即使數據存儲在hdfs上,用avro方式無法解析數據。
對應類的schema如下:
{"namespace": "com.learn.avro",
"type": "record",
"name": "UserModel",
"fields": [
{"name": "name", "type": "string"},
{"name": "address", "type": ["string", "null"]},
{"name": "age", "type": "int"},
{"name": "job", "type": "string"}
]
}
應該把它上傳到hdfs上。
數據保存到hdfs上後,使用下面的代碼來讀取數據:
public static void read1() throws IOException {
//數據的schema已經存在文件的頭部,反序列化時不需要指定schema
Configuration configuration = new Configuration();
String hdfsURI = "hdfs://localhost:9000/";
String hdfsFileURL = "flume_data/year=2018/moth=02/day=09/FlumeData.1518169360741.avro";
FileContext fileContext = FileContext.getFileContext(URI.create(hdfsURI), configuration);
//FileSystem hdfs = FileSystem.get(URI.create(hdfsURI), configuration);
//FSDataInputStream inputStream = hdfs.open(new Path(hdfsURI+hdfsFileURL));
AvroFSInput avroFSInput = new AvroFSInput(fileContext,new Path(hdfsURI+hdfsFileURL));
DatumReader<GenericRecord> genericRecordDatumReader = new GenericDatumReader<>();
DataFileReader<GenericRecord> fileReader = new DataFileReader<GenericRecord>(avroFSInput,genericRecordDatumReader);
//可以從讀取的文件中獲取schema信息
Schema schema = fileReader.getSchema();
System.out.println("Get Schema Info:"+schema);
GenericRecord genericUser = null;
while (fileReader.hasNext()) {
//這裏將需要讀取的對象傳入next方法中,用來減小創建對象的開銷和垃圾回收
genericUser = fileReader.next(genericUser);
System.out.println(genericUser);
}
fileReader.close();
}
場景四詳解:這個場景解決的問題是使用Flume 上傳avro 格式的文件到hdfs,以便後續的查詢等工作。
其中Flume 的配置如下:
# memory channel called ch1 on agent1
agent1.channels.ch1.type = memory
# source
agent1.sources.spooldir-source1.channels = ch1
agent1.sources.spooldir-source1.type = spooldir
agent1.sources.spooldir-source1.spoolDir=/home/shenjin/data/avro/
agent1.sources.spooldir-source1.basenameHeader = true
agent1.sources.spooldir-source1.deserializer = AVRO
agent1.sources.spooldir-source1.deserializer.schemaType = LITERAL
# sink
agent1.sinks.hdfs-sink1.channel = ch1
agent1.sinks.hdfs-sink1.type = hdfs
agent1.sinks.hdfs-sink1.hdfs.path = hdfs://localhost:9000/flume_data
agent1.sinks.hdfs-sink1.hdfs.fileType = DataStream
agent1.sinks.hdfs-sink1.hdfs.fileSuffix = .avro
agent1.sinks.hdfs-sink1.serializer = org.apache.flume.serialization.AvroEventSerializer$Builder
agent1.sinks.hdfs-sink1.hdfs.filePrefix = event
agent1.sinks.hdfs-sink1.hdfs.rollSize = 0
agent1.sinks.hdfs-sink1.hdfs.rollCount = 0
# Finally, now that we've defined all of our components, tell
# agent1 which ones we want to activate.
agent1.channels = ch1
agent1.sources = spooldir-source1
agent1.sinks = hdfs-sink
這個配置需要注意的地方有:
(1)agent source type 爲spooldir,表示用來監控某個目錄下面的新增文件,使用avro 作爲其反序列化。
(2) sink 端的serializer 指定的序列化類不是apache flume自帶的,下載的jar請參考場景三。
(3) hdfs.rollsize = 0,hdfs.rollcount = 0的配置保證文件上傳到hdfs上時不會被切割成多個文件。