flume使用Taildir Source採集文件夾數據到hdfs

一、說明

1、此方式適合生產環境;

2、Taildir Source 是Apache flume1.7新推出的,但是CDH Flume1.6做了集成;

3、Taildir Source是高可靠(reliable)的source,他會實時的將文件偏移量寫到json文件中並保存到磁盤。下次重啓Flume時會讀取Json文件獲取文件O偏移量,然後從之前的位置讀取數據,保證數據零丟失;

4、taildir Source可同時監控多個文件夾以及文件。即使文件在實時寫入數據;

5、Taildir Source也是無法採集遞歸文件下的數據,這需要改造源碼;

6、Taildir Source監控一個文件夾下的所有文件一定要用.*正則;

二、conf下新建配置文件

1、在conf下新建hdfs-taildir-logger.conf配置文件

# Name the components on this agent
taildir-hdfs-agent.sources = taildir-source
taildir-hdfs-agent.sinks = hdfs-sink
taildir-hdfs-agent.channels = memory-channel

# Describe/configure the source
taildir-hdfs-agent.sources.taildir-source.type = TAILDIR
taildir-hdfs-agent.sources.taildir-source.filegroups = f1
taildir-hdfs-agent.sources.taildir-source.filegroups.f1 = /home/flume_data/.*
taildir-hdfs-agent.sources.taildir-source.positionFile = /home/taildir/taildir_position.json

# Describe the sink
taildir-hdfs-agent.sinks.hdfs-sink.type = hdfs
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.path = hdfs://master:9000/flume/taildir/%Y%m%d%H%M
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.useLocalTimeStamp = true
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.fileType = CompressedStream
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.writeFormat = Text
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.codeC = gzip
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.filePrefix = wsk
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.rollInterval = 30
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.rollSize = 1024
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.rollCount = 0

# Use a channel which buffers events in memory
taildir-hdfs-agent.channels.memory-channel.type = memory
taildir-hdfs-agent.channels.memory-channel.capacity = 1000
taildir-hdfs-agent.channels.memory-channel.transactionCapacity = 100

# Bind the source and sink to the channel
taildir-hdfs-agent.sources.taildir-source.channels = memory-channel
taildir-hdfs-agent.sinks.hdfs-sink.channel = memory-channel

2、命令啓動

[root@master bin]# ./flume-ng agent --conf conf --conf-file ../conf/hdfs-taildir-logger.conf --name taildir-hdfs-agent -Dflume.root.logger=INFO,console

3、向採集目錄發送數據

[root@master flumeData]# cp student /home/flume_data/
[root@master flumeData]# cp teacher /home/flume_data/
[root@master flumeData]# scp -r test /home/flume_data/
[root@master flumeData]# cat teacher 
chenlaoshi
malaoshi
haolaoshi
weilaoshi
hualaoshi
[root@master flumeData]# cat student 
zhangsan
lisi
wangwu
xiedajiao
xieguangkun

[root@master test]# cat woker 
laowang
laoli
laohao
laoxu
laochen

4、控制檯輸出

20/04/21 10:53:04 INFO instrumentation.MonitoredCounterGroup: Component type: SOURCE, name: taildir-source started
20/04/21 10:53:14 INFO taildir.ReliableTaildirEventReader: Opening file: /home/flume_data/student, inode: 34409843, pos: 0
20/04/21 10:53:14 INFO hdfs.HDFSCompressedDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
20/04/21 10:53:14 INFO hdfs.BucketWriter: Creating hdfs://master:9000/flume/taildir/202004211053/wsk.1587437594134.gz.tmp
20/04/21 10:53:16 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
20/04/21 10:53:16 INFO compress.CodecPool: Got brand-new compressor [.gz]
20/04/21 10:53:46 INFO hdfs.HDFSEventSink: Writer callback called.
20/04/21 10:53:46 INFO hdfs.BucketWriter: Closing hdfs://master:9000/flume/taildir/202004211053/wsk.1587437594134.gz.tmp
20/04/21 10:53:47 INFO hdfs.BucketWriter: Renaming hdfs://master:9000/flume/taildir/202004211053/wsk.1587437594134.gz.tmp to hdfs://master:9000/flume/taildir/202004211053/wsk.1587437594134.gz
20/04/21 10:54:09 INFO taildir.ReliableTaildirEventReader: Opening file: /home/flume_data/teacher, inode: 34409846, pos: 0
20/04/21 10:54:09 INFO hdfs.HDFSCompressedDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
20/04/21 10:54:09 INFO hdfs.BucketWriter: Creating hdfs://master:9000/flume/taildir/202004211054/wsk.1587437649147.gz.tmp
20/04/21 10:54:39 INFO hdfs.HDFSEventSink: Writer callback called.
20/04/21 10:54:39 INFO hdfs.BucketWriter: Closing hdfs://master:9000/flume/taildir/202004211054/wsk.1587437649147.gz.tmp
20/04/21 10:54:39 INFO hdfs.BucketWriter: Renaming hdfs://master:9000/flume/taildir/202004211054/wsk.1587437649147.gz.tmp to hdfs://master:9000/flume/taildir/202004211054/wsk.1587437649147.gz
20/04/21 10:55:19 INFO taildir.TaildirSource: Closed file: /home/flume_data/student, inode: 34409843, pos: 43
20/04/21 10:56:14 INFO taildir.TaildirSource: Closed file: /home/flume_data/teacher, inode: 34409846, pos: 50

5、hdfs存儲

下載並打開,並沒有發現有test文件夾下worker中的數據,因爲Taildir Source也是無法採集遞歸文件下的數據,這需要改造源碼;

三、監控多個目錄的寫法

1、此配置以監控兩個目錄爲例

# Name the components on this agent
taildir-hdfs-agent.sources = taildir-source
taildir-hdfs-agent.sinks = hdfs-sink
taildir-hdfs-agent.channels = memory-channel

# Describe/configure the source
taildir-hdfs-agent.sources.taildir-source.type = TAILDIR
taildir-hdfs-agent.sources.taildir-source.filegroups = f1 f2
taildir-hdfs-agent.sources.taildir-source.filegroups.f1 = /home/flume_data1/.*
taildir-hdfs-agent.sources.taildir-source.filegroups.f2 = /home/flume_data/.*
taildir-hdfs-agent.sources.taildir-source.positionFile = /home/taildir/taildir_position.json

# Describe the sink
taildir-hdfs-agent.sinks.hdfs-sink.type = hdfs
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.path = hdfs://master:9000/flume/taildir1/%Y%m%d%H%M
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.useLocalTimeStamp = true
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.fileType = CompressedStream
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.writeFormat = Text
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.codeC = gzip
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.filePrefix = wsk
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.rollInterval = 30
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.rollSize = 1024
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.rollCount = 0

# Use a channel which buffers events in memory
taildir-hdfs-agent.channels.memory-channel.type = memory
taildir-hdfs-agent.channels.memory-channel.capacity = 1000
taildir-hdfs-agent.channels.memory-channel.transactionCapacity = 100

# Bind the source and sink to the channel
taildir-hdfs-agent.sources.taildir-source.channels = memory-channel
"hdfs-taildir-logger.conf" 32L, 1601C   

2、其他

如果監控指定後綴文件可以這樣寫:

#之監控此目錄下log結尾的文件
taildir-hdfs-agent.sources.taildir-source.filegroups.f1 = /home/flume_data1/.*log
#之監控此目錄下txt結尾的文件
taildir-hdfs-agent.sources.taildir-source.filegroups.f2 = /home/flume_data/.*txt

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章