Flume使用Spooling Directory Source採集文件夾數據到hdfs

一、需求說明

flume監控linux上一個目錄(/home/flume_data)下進入的文件,並寫入hdfs的相應目錄下(hdfs://master:9000/flume/spool/%Y%m%d%H%M)

二、新建配置文件

1、在conf下新建配置文件hdfs-logger.conf

# Name the components on this agent
spool-hdfs-agent.sources = spool-source
spool-hdfs-agent.sinks = hdfs-sink
spool-hdfs-agent.channels = memory-channel

# Describe/configure the source
spool-hdfs-agent.sources.spool-source.type = spooldir
spool-hdfs-agent.sources.spool-source.spoolDir = /home/flume_data

# Describe the sink
spool-hdfs-agent.sinks.hdfs-sink.type = hdfs
spool-hdfs-agent.sinks.hdfs-sink.hdfs.path = hdfs://master:9000/flume/spool/%Y%m%d%H%M
spool-hdfs-agent.sinks.hdfs-sink.hdfs.useLocalTimeStamp = true
spool-hdfs-agent.sinks.hdfs-sink.hdfs.fileType = CompressedStream
spool-hdfs-agent.sinks.hdfs-sink.hdfs.writeFormat = Text
spool-hdfs-agent.sinks.hdfs-sink.hdfs.codeC = gzip
spool-hdfs-agent.sinks.hdfs-sink.hdfs.filePrefix = wsk
spool-hdfs-agent.sinks.hdfs-sink.hdfs.rollInterval = 30
spool-hdfs-agent.sinks.hdfs-sink.hdfs.rollSize = 1024
spool-hdfs-agent.sinks.hdfs-sink.hdfs.rollCount = 0

# Use a channel which buffers events in memory
spool-hdfs-agent.channels.memory-channel.type = memory
spool-hdfs-agent.channels.memory-channel.capacity = 1000
spool-hdfs-agent.channels.memory-channel.transactionCapacity = 100

# Bind the source and sink to the channel
spool-hdfs-agent.sources.spool-source.channels = memory-channel
spool-hdfs-agent.sinks.hdfs-sink.channel = memory-channel

2、說明

(1)spool-hdfs-agent爲agent的名字,需要在啓動flume命令中的- name中配置的;

(2)/home/flume_data爲flume監控採集目錄;

(3)hdfs://master:9000/flume/spool/%Y%m%d%H%M,爲flume輸出hdfs的目錄地址,%Y%m%d%H%M是輸出文件夾時間格式;

(4)flume有三種滾動方式。
按照時間:spool-hdfs-agent.sinks.hdfs-sink.hdfs.rollInterval = 30
按照大小:spool-hdfs-agent.sinks.hdfs-sink.hdfs.rollSize = 1024
按照count:spool-hdfs-agent.sinks.hdfs-sink.hdfs.rollCount = 0

滾動的意思是當flume監控的目錄達到了配置信息中的某一條滾動方式的時候,會觸發flume提交一個文件到hdfs中(即在hdfs中生成一個文件)

rollInterval

默認值:30
hdfs sink間隔多長將臨時文件滾動成最終目標文件,單位:秒;
如果設置成0,則表示不根據時間來滾動文件;
注:滾動(roll)指的是,hdfs sink將臨時文件重命名成最終目標文件,並新打開一個臨時文件來寫入數據;

rollSize

默認值:1024
當臨時文件達到該大小(單位:bytes)時,滾動成目標文件;
如果設置成0,則表示不根據臨時文件大小來滾動文件;

rollCount

默認值:10
當events數據達到該數量時候,將臨時文件滾動成目標文件;
如果設置成0,則表示不根據events數據來滾動文件;

(5)rollSize控制的大小是指的壓縮前的,所以若hdfs文件使用了壓縮,需調大rollsize的大小

(6)當文件夾下的某個文件被採集到hdfs上,會有個。complete的標誌

(7)使用Spooling Directory Source採集文件數據時若該文件數據已經被採集,再對該文件做修改是會報錯的停止的,其次若放進去一個已經完成採集的同名數據文件也是會報錯停止的

(8)寫HDFS數據可按照時間分區,注意改時間刻度內無數據則不會生成該時間文件夾

(9)生成的文件名稱默認是前綴+時間戳,這個是可以更改的。

三、啓動flume

1、命令

[root@master bin]# ./flume-ng agent --conf conf --conf-file ../conf/hdfs-logger.conf --name spool-hdfs-agent -Dflume.root.logger=INFO,console

2、向採集目錄發送文件

[root@master flumeData]# cp teacher /home/flume_data/
[root@master flumeData]# cp student /home/flume_data/
[root@master flumeData]# cat teacher 
chenlaoshi
malaoshi
haolaoshi
weilaoshi
hualaoshi
[root@master flumeData]# cat student 
zhangsan
lisi
wangwu
xiedajiao
xieguangkun

3、控制檯日誌打印

20/04/21 10:08:56 INFO instrumentation.MonitoredCounterGroup: Component type: SOURCE, name: spool-source started
20/04/21 10:09:07 INFO avro.ReliableSpoolingFileEventReader: Last read took us just up to a file boundary. Rolling to the next file, if there is one.
20/04/21 10:09:07 INFO avro.ReliableSpoolingFileEventReader: Preparing to move file /home/flume_data/teacher to /home/flume_data/teacher.COMPLETED
20/04/21 10:09:07 INFO hdfs.HDFSCompressedDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
20/04/21 10:09:07 INFO hdfs.BucketWriter: Creating hdfs://master:9000/flume/spool/202004211009/wsk.1587434947074.gz.tmp
20/04/21 10:09:08 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
20/04/21 10:09:08 INFO compress.CodecPool: Got brand-new compressor [.gz]
20/04/21 10:09:17 INFO avro.ReliableSpoolingFileEventReader: Last read took us just up to a file boundary. Rolling to the next file, if there is one.
20/04/21 10:09:17 INFO avro.ReliableSpoolingFileEventReader: Preparing to move file /home/flume_data/student to /home/flume_data/student.COMPLETED

4、監控目錄

[root@master flume_data]# ls
student.COMPLETED  teacher.COMPLETED

5、hdfs存儲效果

下載解壓打開

四、此方式的缺點

1、雖然能監控一個文件夾,但是無法監控遞歸的文件夾中的數據;

2、若採集時Flume掛了,無法保證重啓時還從之前文件讀取的那一行繼續採集數據;

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章