Flume 使用場景記錄

Flume用來收集日誌信息,這裏記錄以下使用場景:


場景一:使用avro source ,memory,logger 將收集到的日誌打印在標準輸出,適合測試。


場景二:使用avro source,kafka channel,hdfs 將日誌以"Flume Event" Avro Event Serializer 的形式保存在hdfs上,這種方式生成的.avro文件中的每一條記錄的字段中包含headers和body兩部分內容,其中headers是flume event的頭部消息,body是一個bytes類型的數組,這是真正的需要的數據。讀取這種方式生成的.avro 提取的schema如下:

{"type":"record","name":"Event","fields":[{"name":"headers","type":{"type":"map","values":"string"}},{"name":"body","type":"bytes"}]}


場景三:使用avro source, kafka channel,hdfs 將日誌以Avro Event Serializer的形式保存,即希望保存後的.avro能夠被直接讀取用來做統計計算,這裏根據所選序列化類的不同有兩種不同的配置方法:

    第一種是:使用CDH 提供的org.apache.flume.serialization.AvroEventSerializer$Builder 序列化類來序列化消息到Hdfs 上,這個序列化器要求在flume event 的headers 中包含"flume.avro.schema.literal" 或"flume.avro.schema.url" 來指定數據所使用的avro schema,否則數據將無法保存,flume報錯。使用這種配置需要注意:

  (1) sink端的序列化器使用cdk提供的,git上有源碼,可以自己下載編譯(我編譯失敗了),也可以直接下載Jar:https://repository.cloudera.com/content/repositories/releases/com/cloudera/cdk/cdk-flume-avro-event-serializer/0.9.2/


第二種是:使用apache flume提供的org.apache.flume.sink.hdfs.AvroEventSerializer$Builder 序列化類作爲其保存數據到hdfs上的工具,這個序列化類不要求在event header中包含avro schema信息,但是在sink端指定avro schema url 。

這兩種配置能夠成功需要注意以下一點:

    (1)使用avro client發送數據時,非String數據,例如常用的對象數據需要轉化成字節數組,然後才能發送,這裏需要注意的是不能把java 對象直接轉化爲字節數據,而應該使用avro 提供api按照給定的schema進行轉化,否則從hdfs讀取avro格式的文件是不可用的,因爲avro無法進行decoder.


場景四:利用Spooling Directory Source,memory,hdfs 監控生成的avro文件,相當於把avro從源服務器上傳到目標服務器,但是需要注意以下幾點

    (1)應該把要上傳的avro文件在非Spooling Directory目錄下整理完成後再移動到Spooling Directory,否則agent將會出現錯誤。

     (2)當Spooling Directory目錄下的文件已經被flume讀取後,任何更改都是無效的。

      (3)f放到Spooling Directory目錄下的所有文件命名要唯一,否則會引發錯誤。



下面詳細記錄每種場景下的配置及相關的java 代碼:

場景一詳解:

   這種方式是官網提供的配置及demo,它能讓使用者快速感受到flume的作用。

    配置如下:

a1.channels = c1
a1.sources = r1
a1.sinks = k1

a1.channels.c1.type = memory

a1.sources.r1.channels = c1
a1.sources.r1.type = avro

a1.sources.r1.bind = 127.0.0.1
a1.sources.r1.port = 41414

a1.sinks.k1.channel = c1
a1.sinks.k1.type = logger

 發送avro事件的客戶端如下:

package com.learn.flume;

import com.learn.model.UserModel;
import org.apache.avro.Schema;
import org.apache.avro.io.BinaryEncoder;
import org.apache.avro.io.EncoderFactory;
import org.apache.avro.reflect.ReflectDatumWriter;
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.api.RpcClient;
import org.apache.flume.api.RpcClientFactory;
import org.apache.flume.event.EventBuilder;

import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;
import java.util.Properties;

public class SendAvroClient {
    public static void main(String[] args) throws IOException {
        RpcClientFacade client = new RpcClientFacade();
        // Initialize client with the remote Flume agent's host and port
        client.init("127.0.0.1", 41414);

        // Send 10 events to the remote Flume agent. That agent should be
        // configured to listen with an AvroSource.
        String sampleData = "china";
        for (int i = 0; i < 10; i++) {

            client.sendStringDataToFlume(sampleData);
        }
        
        client.cleanUp();
    }
}

class RpcClientFacade {
    private RpcClient client;
    private String hostname;
    private int port;
    private static Properties p= new Properties();

    static {
        p.put("client.type","default");
        p.put("hosts","h1");
        p.put("hosts.h1","127.0.0.1:41414");
        p.put("batch-size",100);
        p.put("connect-timeout",20000);
        p.put("request-timeout",20000);
    }

    public void init(String hostname, int port) {
        // Setup the RPC connection
        this.hostname = hostname;
        this.port = port;
        this.client = RpcClientFactory.getInstance(p);
        if (this.client == null) {
            System.out.println("init client fail");
        }
       // this.client = RpcClientFactory.getInstance(hostname, port);
        // Use the following method to create a thrift client (instead of the above line):
        // this.client = RpcClientFactory.getThriftInstance(hostname, port);
    }

    public void sendStringDataToFlume(String data) {
        // Create a FEventBuil
        Event event = EventBuilder.withBody(data, Charset.forName("UTF-8"));
        //event.getHeaders().put("kkkk","aaaaa");
        // Send the event
        try {
            client.append(event);
        } catch (EventDeliveryException e) {
            // clean up and recreate the client
            client.close();
            client = null;
            client =  RpcClientFactory.getDefaultInstance(hostname, port);
            // Use the following method to create a thrift client (instead of the above line):
            // this.client = RpcClientFactory.getThriftInstance(hostname, port);
            e.printStackTrace();
        }
    }

    public void cleanUp() {
        // Close the RPC connection
        client.close();
    }
}

這裏利用Netty客戶端發送flume消息,但是發送的消息需要是String類型,因爲最終消息內容是以字節的形式在Flume中傳遞的。


場景二詳解:場景二使用 "Flume Event" Avro Event Serializer作爲其序列化器,最終保存在hdfs上的數據除了數據本省內容外還包括Flume event headers 消息部分。

下面是其在Flume中的配置:

a1.channels = kafka_channel
a1.sources = r1
a1.sinks = k1

a1.channels.kafka_channel.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.kafka_channel.kafka.bootstrap.servers=localhost:9092
a1.channels.kafka_channel.kafka.topic=test
a1.channels.kafka_channel.group.id=flume_group

a1.sources.r1.channels = kafka_channel
a1.sources.r1.type = avro
#添加攔截器
a1.sources.r1.interceptors=i1
a1.sources.r1.interceptors.i1.type=timestamp

a1.sources.r1.bind = 127.0.0.1
a1.sources.r1.port = 41414

##************k1 使用 "Flume Event" Avro Event Serializer start*************************#
a1.sinks.k1.channel = kafka_channel
a1.sinks.k1.type = hdfs
#選用默認的分區粒度
##a1.sinks.k1.hdfs.path=hdfs://localhost:9000/flume_data
#選擇天作爲分區粒度,這裏需要啓動header中的timestap功能,即給source添加攔截器
a1.sinks.k1.hdfs.path=hdfs://localhost:9000/flume_data/year=%Y/moth=%m/day=%d
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text
a1.sinks.k1.hdfs.fileSuffix=.avro
#set in use file prefix is "_",because hadoop mapreduce will ignore those files start with "_" prefix.
a1.sinks.k1.hdfs.inUsePrefix=_
a1.sinks.k1.serializer=avro_event
##************k1 使用 "Flume Event" Avro Event Serializer end*************************#

avro 客戶端代碼:

package com.learn.flume;

import com.learn.model.UserModel;
import com.learn.utils.ByteArrayUtils;
import org.apache.avro.Schema;
import org.apache.avro.io.BinaryEncoder;
import org.apache.avro.io.EncoderFactory;
import org.apache.avro.reflect.ReflectDatumWriter;
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.api.RpcClient;
import org.apache.flume.api.RpcClientFactory;
import org.apache.flume.event.EventBuilder;

import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.Properties;

public class SendAvroClient {
    public static void main(String[] args) throws IOException {
        RpcClientFacade client = new RpcClientFacade();
        // Initialize client with the remote Flume agent's host and port
        client.init("127.0.0.1", 41414);

        for (int i = 0; i < 10; i++) {
            UserModel userModel = new UserModel();
            userModel.setAddress("hangzhou");
            userModel.setAge(26);
            userModel.setJob("it");
            userModel.setName("shenjin");
            client.sendObjectDataToFlume(userModel);
        }

        client.cleanUp();
    }
}

class RpcClientFacade {
    private RpcClient client;
    private String hostname;
    private int port;

    private static Properties p = new Properties();

    static {
        p.put("client.type", "default");
        p.put("hosts", "h1");
        p.put("hosts.h1", "127.0.0.1:41414");
        p.put("batch-size", 100);
        p.put("connect-timeout", 20000);
        p.put("request-timeout", 20000);
    }

    public void init(String hostname, int port) {
        // Setup the RPC connection
        this.hostname = hostname;
        this.port = port;

        this.client = RpcClientFactory.getInstance(p);
        if (this.client == null) {
            System.out.println("init client fail");
        }
    }

    public void sendStringDataToFlume(String data) {
        // Create a FEventBuil
        Event event = EventBuilder.withBody(data, Charset.forName("UTF-8"));
        // Send the event
        try {
            client.append(event);
        } catch (EventDeliveryException e) {
            // clean up and recreate the client
            client.close();
            client = null;
            client = RpcClientFactory.getDefaultInstance(hostname, port);
            e.printStackTrace();
        }
    }

    public void sendObjectDataToFlume(Object data) throws IOException {


        Event event = EventBuilder.withBody(ByteArrayUtils.objectToBytes(data).get());
        // Send the event
        try {
            client.append(event);
        } catch (EventDeliveryException e) {
            // clean up and recreate the client
            client.close();
            client = null;
            client = RpcClientFactory.getDefaultInstance(hostname, port);
            // Use the following method to create a thrift client (instead of the above line):
            // this.client = RpcClientFactory.getThriftInstance(hostname, port);
            e.printStackTrace();
        }
    }
     public void cleanUp() {
        // Close the RPC connection
        client.close();
    }
}

這個客戶端的方法是基於場景一的改動,不同的是提供了 sendObjectDataToFlume(Object data) 用法用來發送java對象,而不是簡單的String類型的消息,這裏就需要把Object 轉換爲bytes數組後,調用客戶端提供的方法發送消息,於是編寫了java對象與bytes數組之間轉換的工具類,如下:

package com.learn.utils;

import java.io.*;
import java.util.Optional;

public class ByteArrayUtils {

    /**
     * java 對象轉字節數據
     * @param obj
     * @param <T>
     * @return
     */
    public static<T> Optional<byte[]> objectToBytes(T obj){
        byte[] bytes = null;
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        ObjectOutputStream sOut;
        try {
            sOut = new ObjectOutputStream(out);
            sOut.writeObject(obj);
            sOut.flush();
            bytes= out.toByteArray();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return Optional.ofNullable(bytes);
    }

    /**
     * 字節數組轉java 對象
     * @param bytes
     * @param <T>
     * @return
     */
    public static<T> Optional<T> bytesToObject(byte[] bytes) {
        T t = null;
        ByteArrayInputStream in = new ByteArrayInputStream(bytes);
        ObjectInputStream sIn;
        try {
            sIn = new ObjectInputStream(in);
            t = (T)sIn.readObject();
        } catch (Exception e) {
            e.printStackTrace();
        }
        return Optional.ofNullable(t);

    }
}

這樣消息能夠順利發送並以avro格式存儲在Hdfs,然後使用avro提供的api讀取文件的內容,並且反序列化爲對象:

public static void read2() throws IOException {
        //數據的schema已經存在文件的頭部,反序列化時不需要指定schema
        Configuration configuration = new Configuration();
        String hdfsURI = "hdfs://localhost:9000/";
        String hdfsFileURL = "flume_data/year=2018/moth=02/day=07/FlumeData.1517970870974.avro";
        FileContext fileContext = FileContext.getFileContext(URI.create(hdfsURI), configuration);
        //FileSystem hdfs = FileSystem.get(URI.create(hdfsURI), configuration);
        //FSDataInputStream inputStream = hdfs.open(new Path(hdfsURI+hdfsFileURL));
        AvroFSInput avroFSInput  = new AvroFSInput(fileContext,new Path(hdfsURI+hdfsFileURL));
        DatumReader<GenericRecord> genericRecordDatumReader = new GenericDatumReader<>();
        DataFileReader<GenericRecord> fileReader = new DataFileReader<GenericRecord>(avroFSInput,genericRecordDatumReader);
        //可以從讀取的文件中獲取schema信息
        Schema schema = fileReader.getSchema();
        System.out.println("Get Schema Info:"+schema);
        GenericRecord genericUser = null;
        while (fileReader.hasNext()) {
            //這裏將需要讀取的對象傳入next方法中,用來減小創建對象的開銷和垃圾回收
            genericUser = fileReader.next(genericUser);
            byte[] o = ((ByteBuffer)genericUser.get("body")).array();
            UserModel userModel= (UserModel)(((Optional)ByteArrayUtils.bytesToObject(o)).get());
            System.out.println(userModel);
            System.out.println(userModel.age);
        }
        fileReader.close();
    }

這樣就可以獲取存儲的信息,但是感覺很麻煩。使用的model如下:

package com.learn.model;

import java.io.Serializable;

public class UserModel implements Serializable{

    public String name ;

    public Integer age;

    private String address;

    private String job;

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public Integer getAge() {
        return age;
    }

    public void setAge(Integer age) {
        this.age = age;
    }

    public String getAddress() {
        return address;
    }

    public void setAddress(String address) {
        this.address = address;
    }

    public String getJob() {
        return job;
    }

    public void setJob(String job) {
        this.job = job;
    }

    @Override
    public String toString() {
        return "UserModel{" +
                "name='" + name + '\'' +
                ", age=" + age +
                ", address='" + address + '\'' +
                ", job='" + job + '\'' +
                '}';
    }
}


場景3詳解(重):個人認爲場景三在生產環境使用比較合適的,由服務器把需要記錄的信息以對象的形式通過avro RPC發送到flume,然後存儲在hdfs上,相對於場景二它只包含消息內容本身,可以用於計算引擎進行讀取計算。

Flume 配置如下:

方式一:使用非apache flume提供的序列化類,即使用org.apache.flume.serialization.AvroEventSerializer$Builder類序列化,配置如下:

a1.channels = kafka_channel
a1.sources = r1
a1.sinks = k2

a1.channels.kafka_channel.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.kafka_channel.kafka.bootstrap.servers=localhost:9092
a1.channels.kafka_channel.kafka.topic=test
a1.channels.kafka_channel.group.id=flume_group

a1.sources.r1.channels = kafka_channel
a1.sources.r1.type = avro
#添加攔截器
a1.sources.r1.interceptors=i1 i2
a1.sources.r1.interceptors.i1.type=timestamp
#必須指定avro schema url
a1.sources.r1.interceptors.i2.type=static
a1.sources.r1.interceptors.i2.key=flume.avro.schema.url
a1.sources.r1.interceptors.i2.value=hdfs://localhost:9000/flume_schema/UserModel.avsc

a1.sources.r1.bind = 127.0.0.1
a1.sources.r1.port = 41414

##************k2 使用  Avro Event Serializer start*************************#
a1.sinks.k2.channel = kafka_channel
a1.sinks.k2.type = hdfs
#選擇天作爲分區粒度,這裏需要啓動header中的timestap功能,即給source添加攔截器
a1.sinks.k2.hdfs.path=hdfs://localhost:9000/flume_data/year=%Y/moth=%m/day=%d
a1.sinks.k2.hdfs.fileType=DataStream
a1.sinks.k2.hdfs.writeFormat=Text
a1.sinks.k2.hdfs.fileSuffix=.avro
#set in use file prefix is "_",because hadoop mapreduce will ignore those files start with "_" prefix.
a1.sinks.k2.hdfs.inUsePrefix=_
a1.sinks.k2.serializer=org.apache.flume.serialization.AvroEventSerializer$Builder
a1.sinks.k2.hdfs.rollSize = 134217728
a1.sinks.k2.hdfs.rollInterval=30
##************k2 使用  Avro Event Serializer end*************************#

這個配置中需要注意的有以下幾點:

   (1)在avro source 中添加了兩個攔截器,一個是timestap類型的,用來決定在hdfs上文件存儲的目錄結構的,另外一個是static攔截器,這裏面指定了當前消息對應的avro schema,這個是sink serializer需要的參數,以上兩個攔截器將消息加載在flume event 的headers中,並不影響消息體。

(2 )在hdfs sink 中需要注意指定的序列化類

a1.sinks.k2.serializer=org.apache.flume.serialization.AvroEventSerializer$Builder

並不是Flume 自身提供的那個類,需要重新下載jar包,下載完成後把它丟到flume安裝目錄的lib目錄下,jar包下載方式有兩種:

一是源碼自己編譯:https://github.com/cloudera/cdk

二是下載已經編譯好的jar:https://repository.cloudera.com/content/repositories/releases/com/cloudera/cdk/cdk-flume-avro-event-serializer/0.9.2/


方式二:使用apache flume提供的序列化類,即使用org.apache.flume.sink.hdfs.AvroEventSerializer$Builder類序列化,配置如下:

a1.channels = kafka_channel
a1.sources = r1
a1.sinks = k2

a1.channels.kafka_channel.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.kafka_channel.kafka.bootstrap.servers=localhost:9092
a1.channels.kafka_channel.kafka.topic=test
a1.channels.kafka_channel.group.id=flume_group

a1.sources.r1.channels = kafka_channel
a1.sources.r1.type = avro
#添加攔截器
a1.sources.r1.interceptors=i1
a1.sources.r1.interceptors.i1.type=timestamp

a1.sources.r1.bind = 127.0.0.1
a1.sources.r1.port = 41414

##************k2 使用  Avro Event Serializer start*************************#
a1.sinks.k2.channel = kafka_channel
a1.sinks.k2.type = hdfs
#選擇天作爲分區粒度,這裏需要啓動header中的timestap功能,即給source添加攔截器
a1.sinks.k2.hdfs.path=hdfs://localhost:9000/flume_data/year=%Y/moth=%m/day=%d
a1.sinks.k2.hdfs.fileType=DataStream
a1.sinks.k2.hdfs.writeFormat=Text
a1.sinks.k2.hdfs.fileSuffix=.avro
#set in use file prefix is "_",because hadoop mapreduce will ignore those files start with "_" prefix.
a1.sinks.k2.hdfs.inUsePrefix=_
a1.sinks.k2.serializer=org.apache.flume.sink.hdfs.AvroEventSerializer$Builder
a1.sinks.k2.serializer.schemaURL = hdfs://localhost:9000/flume_schema/UserModel.avsc
a1.sinks.k2.hdfs.rollSize = 134217728
a1.sinks.k2.hdfs.rollInterval=30
##************k2 使用  Avro Event Serializer end*************************#

這個序列化類是apache flume提供的,不需要額外的jar包。


Avro 客戶端的代碼如下:

package com.learn.flume;

import com.google.gson.JsonObject;
import com.learn.model.UserModel;
import com.learn.utils.ByteArrayUtils;
import org.apache.avro.Schema;
import org.apache.avro.io.BinaryEncoder;
import org.apache.avro.io.EncoderFactory;
import org.apache.avro.reflect.ReflectDatumWriter;
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.api.NettyAvroRpcClient;
import org.apache.flume.api.RpcClient;
import org.apache.flume.api.RpcClientFactory;
import org.apache.flume.event.EventBuilder;
import org.apache.flume.event.JSONEvent;

import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;
import java.util.Properties;

public class MyApp {
    public static void main(String[] args) throws IOException {
        MyRpcClientFacade client = new MyRpcClientFacade();
        // Initialize client with the remote Flume agent's host and port
        client.init("127.0.0.1", 41414);


        // Send 10 events to the remote Flume agent. That agent should be
        // configured to listen with an AvroSource.
        for (int i = 0; i < 10; i++) {
            UserModel userModel = new UserModel();
            userModel.setAddress("hangzhou");
            userModel.setAge(26);
            userModel.setJob("it");
            userModel.setName("shenjin");
            client.sendObjectDataToFlume(userModel);
        }

        client.cleanUp();
    }
}

class MyRpcClientFacade {
    private RpcClient client;
    private String hostname;
    private int port;

    private static Properties p = new Properties();

    static {
        p.put("client.type", "default");
        p.put("hosts", "h1");
        p.put("hosts.h1", "127.0.0.1:41414");
        p.put("batch-size", 100);
        p.put("connect-timeout", 20000);
        p.put("request-timeout", 20000);
    }

    public void init(String hostname, int port) {
        // Setup the RPC connection
        this.hostname = hostname;
        this.port = port;

        this.client = RpcClientFactory.getInstance(p);
        if (this.client == null) {
            System.out.println("init client fail");
        }
    }

    public void sendStringDataToFlume(String data) {
        // Create a FEventBuil
        Event event = EventBuilder.withBody(data, Charset.forName("UTF-8"));
        event.getHeaders().put("kkkk", "aaaaa");
        // Send the event
        try {
            client.append(event);
        } catch (EventDeliveryException e) {
            // clean up and recreate the client
            client.close();
            client = null;
            client = RpcClientFactory.getDefaultInstance(hostname, port);
            e.printStackTrace();
        }
    }

    public void sendObjectDataToFlume(Object data) throws IOException {
        InputStream inputStream = ClassLoader.getSystemResourceAsStream("schema/UserModel.avsc");
        Schema schema = new Schema.Parser().parse(inputStream);

        Event event = EventBuilder.withBody(serializeAvro(data, schema));
        try {
            client.append(event);
        } catch (EventDeliveryException e) {
            // clean up and recreate the client
            client.close();
            client = null;
            client = RpcClientFactory.getDefaultInstance(hostname, port);
            e.printStackTrace();
        }
    }


    public void cleanUp() {
        // Close the RPC connection
        client.close();
    }

    /**
     * 使用avro提供的編碼器生成字節數組,否則存取在hdfs上的數據無法讀取
     * @param datum
     * @param schema
     * @return
     * @throws IOException
     */
    public byte[] serializeAvro(Object datum, Schema schema) throws IOException {
        ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
        ReflectDatumWriter<Object> writer = new ReflectDatumWriter<>(schema);
        BinaryEncoder binaryEncoder = EncoderFactory.get().binaryEncoder(outputStream, null);
        outputStream.reset();
        writer.write(datum, binaryEncoder);
        binaryEncoder.flush();
        return outputStream.toByteArray();
    }

}

這裏與場景二中最大的不同就是java對象轉字節數組的方式,場景二中使用的是常用的java對象轉字節數組的方式,而這裏需要結合avro提供的編碼器和scheme來生成對象的字節數組,然後通過RPC發送,否則即使數據存儲在hdfs上,用avro方式無法解析數據。

對應類的schema如下:

{"namespace": "com.learn.avro",
 "type": "record",
 "name": "UserModel",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "address",  "type": ["string", "null"]},
     {"name": "age", "type": "int"},
     {"name": "job", "type": "string"}
 ]
}

應該把它上傳到hdfs上。

數據保存到hdfs上後,使用下面的代碼來讀取數據:

public static void read1() throws IOException {
        //數據的schema已經存在文件的頭部,反序列化時不需要指定schema
        Configuration configuration = new Configuration();
        String hdfsURI = "hdfs://localhost:9000/";
        String hdfsFileURL = "flume_data/year=2018/moth=02/day=09/FlumeData.1518169360741.avro";
        FileContext fileContext = FileContext.getFileContext(URI.create(hdfsURI), configuration);
        //FileSystem hdfs = FileSystem.get(URI.create(hdfsURI), configuration);
        //FSDataInputStream inputStream = hdfs.open(new Path(hdfsURI+hdfsFileURL));
        AvroFSInput avroFSInput  = new AvroFSInput(fileContext,new Path(hdfsURI+hdfsFileURL));
        DatumReader<GenericRecord> genericRecordDatumReader = new GenericDatumReader<>();
        DataFileReader<GenericRecord> fileReader = new DataFileReader<GenericRecord>(avroFSInput,genericRecordDatumReader);
        //可以從讀取的文件中獲取schema信息
        Schema schema = fileReader.getSchema();
        System.out.println("Get Schema Info:"+schema);
        GenericRecord genericUser = null;
        while (fileReader.hasNext()) {
            //這裏將需要讀取的對象傳入next方法中,用來減小創建對象的開銷和垃圾回收
            genericUser = fileReader.next(genericUser);
            System.out.println(genericUser);
        }
        fileReader.close();
    }


場景四詳解:這個場景解決的問題是使用Flume 上傳avro 格式的文件到hdfs,以便後續的查詢等工作。

其中Flume 的配置如下:

# memory channel called ch1 on agent1
agent1.channels.ch1.type = memory

# source
agent1.sources.spooldir-source1.channels = ch1
agent1.sources.spooldir-source1.type = spooldir
agent1.sources.spooldir-source1.spoolDir=/home/shenjin/data/avro/
agent1.sources.spooldir-source1.basenameHeader = true
agent1.sources.spooldir-source1.deserializer = AVRO
agent1.sources.spooldir-source1.deserializer.schemaType = LITERAL

# sink
agent1.sinks.hdfs-sink1.channel = ch1
agent1.sinks.hdfs-sink1.type = hdfs

agent1.sinks.hdfs-sink1.hdfs.path = hdfs://localhost:9000/flume_data
agent1.sinks.hdfs-sink1.hdfs.fileType = DataStream
agent1.sinks.hdfs-sink1.hdfs.fileSuffix = .avro
agent1.sinks.hdfs-sink1.serializer =  org.apache.flume.serialization.AvroEventSerializer$Builder

agent1.sinks.hdfs-sink1.hdfs.filePrefix = event
agent1.sinks.hdfs-sink1.hdfs.rollSize = 0
agent1.sinks.hdfs-sink1.hdfs.rollCount = 0

# Finally, now that we've defined all of our components, tell
# agent1 which ones we want to activate.
agent1.channels = ch1
agent1.sources = spooldir-source1
agent1.sinks = hdfs-sink

這個配置需要注意的地方有:

  (1)agent source type 爲spooldir,表示用來監控某個目錄下面的新增文件,使用avro 作爲其反序列化。

    (2)  sink 端的serializer 指定的序列化類不是apache flume自帶的,下載的jar請參考場景三。

    (3)  hdfs.rollsize = 0,hdfs.rollcount = 0的配置保證文件上傳到hdfs上時不會被切割成多個文件。



發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章