大數據之Flume組件的使用

原創

溪水流长

2020-02-23 13:28

flume

flume 是什麼?

簡介

Flume是Cloudera提供的一個高可用的，高可靠的，分佈式的海量日誌採集、聚合和傳輸的軟件。當前Flume有兩個版本。Flume 0.9X版本的統稱Flume OG（original generation），Flume1.X版本的統稱Flume NG（next generation）。由於Flume NG經過核心組件、核心配置以及代碼架構重構，與Flume OG有很大不同，使用時請注意區分。改動的另一原因是將Flume納入 apache 旗下，Cloudera Flume 改名爲 Apache Flume。

flume框架基礎

理性認知：
1、Flume在集羣中扮演的角色
Flume、Kafka用來實時進行數據收集，Spark、Storm用來實時處理數據，impala用來實時查詢。
2、Flume框架簡介
1.1 Flume提供一個分佈式的，可靠的，對大數據量的日誌進行高效收集、聚集、移動的服務，Flume只能在Unix環境下運行。
1.2 Flume基於流式架構，容錯性強，也很靈活簡單，主要用於在線實時分析。
1.3 角色
** Source
用於採集數據，Source是產生數據流的地方，同時Source會將產生的數據流傳輸到Channel，這個有點類似於Java IO部分的Channel
** Channel
用於橋接Sources和Sinks，類似於一個隊列。
** Sink
從Channel收集數據，將數據寫到目標源（可以是下一個Source，也可以是HDFS或者HBase）
1.4 傳輸單元
** Event
Flume數據傳輸的基本單元，以事件的形式將數據從源頭送至目的地
1.5 傳輸過程
source監控某個文件，文件產生新的數據，拿到該數據後，
將數據封裝在一個Event中，並put到channel後commit提交，
channel隊列先進先出，sink去channel隊列中拉取數據，然後寫入到hdfs或者HBase中。
3、安裝配置FLume
3.1 flume-env.sh
配置Java的環境變量
4、Flume幫助命令
$ bin/flume-ng
5、案例：
5.1、案例一：Flume監聽端口，輸出端口數據。
5.1.1、創建Flume Agent配置文件flume-telnet.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

  			# Describe/configure the source
  			a1.sources.r1.type = netcat
  			a1.sources.r1.bind = localhost
  			a1.sources.r1.port = 44444

  			# Describe the sink
  			a1.sinks.k1.type = logger

  			# Use a channel which buffers events in memory
  			a1.channels.c1.type = memory
  			a1.channels.c1.capacity = 1000
  			a1.channels.c1.transactionCapacity = 100

  			# Bind the source and sink to the channel
  			a1.sources.r1.channels = c1
  			a1.sinks.k1.channel = c1
  		5.1.2、安裝telnet工具
  			$ sudo rpm -ivh telnet-server-0.17-59.el7.x86_64.rpm 
  			$ sudo rpm -ivh telnet-0.17-59.el7.x86_64.rpm
  		5.1.3、首先判斷44444端口是否被佔用
  			$ netstat -an | grep 44444
  		5.1.4、先開啓flume先聽端口
  			$ bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/flume-telnet.conf -Dflume.root.logger==INFO,console
  		5.1.5、使用telnet工具向本機的44444端口發送內容。
  			$ telnet localhost 44444

  	5.2、案例二：監聽上傳Hive日誌文件到HDFS
  		5.2.1 拷貝Hadoop相關jar到Flume的lib目錄下
  			share/hadoop/common/lib/hadoop-auth-2.5.0-cdh5.3.6.jar
  			share/hadoop/common/lib/commons-configuration-1.6.jar
  			share/hadoop/mapreduce1/lib/hadoop-hdfs-2.5.0-cdh5.3.6.jar
  			share/hadoop/common/hadoop-common-2.5.0-cdh5.3.6.jar
  		5.2.2 創建flume-hdfs.conf文件
  				# Name the components on this agent
  				a2.sources = r2
  				a2.sinks = k2
  				a2.channels = c2

  				# Describe/configure the source
  				a2.sources.r2.type = exec
  				a2.sources.r2.command = tail -f /opt/modules/cdh/hive-0.13.1-cdh5.3.6/logs/hive.log
  				a2.sources.r2.shell = /bin/bash -c

  				# Describe the sink
  				a2.sinks.k2.type = hdfs
  				a2.sinks.k2.hdfs.path = hdfs://192.168.122.20:8020/flume/%Y%m%d/%H
  				#上傳文件的前綴
  				a2.sinks.k2.hdfs.filePrefix = events-hive-
  				#是否按照時間滾動文件夾
  				a2.sinks.k2.hdfs.round = true
  				#多少時間單位創建一個新的文件夾
  				a2.sinks.k2.hdfs.roundValue = 1
  				#重新定義時間單位
  				a2.sinks.k2.hdfs.roundUnit = hour
  				#是否使用本地時間戳
  				a2.sinks.k2.hdfs.useLocalTimeStamp = true
  				#積攢多少個Event才flush到HDFS一次
  				a2.sinks.k2.hdfs.batchSize = 1000
  				#設置文件類型，可支持壓縮
  				a2.sinks.k2.hdfs.fileType = DataStream
  				#多久生成一個新的文件
  				a2.sinks.k2.hdfs.rollInterval = 600
  				#設置每個文件的滾動大小
  				a2.sinks.k2.hdfs.rollSize = 134217700
  				#文件的滾動與Event數量無關
  				a2.sinks.k2.hdfs.rollCount = 0
  				#最小冗餘數
  				a2.sinks.k2.hdfs.minBlockReplicas = 1


  				# Use a channel which buffers events in memory
  				a2.channels.c2.type = memory
  				a2.channels.c2.capacity = 1000
  				a2.channels.c2.transactionCapacity = 100

  				# Bind the source and sink to the channel
  				a2.sources.r2.channels = c2
  				a2.sinks.k2.channel = c2
  			5.2.3、執行監控配置
  				$ bin/flume-ng agent --conf conf/ --name a2 --conf-file conf/flume-hdfs.conf 

  	5.3、案例三：Flume監聽整個目錄
  		5.3.1 創建配置文件flume-dir.conf
  			$ cp -a flume-hdfs.conf flume-dir.conf
  				a3.sources = r3
  				a3.sinks = k3
  				a3.channels = c3

  				# Describe/configure the source
  				a3.sources.r3.type = spooldir
  				a3.sources.r3.spoolDir = /opt/modules/cdh/apache-flume-1.5.0-cdh5.3.6-bin/upload
  				a3.sources.r3.fileHeader = true
  				#忽略所有以.tmp結尾的文件，不上傳
  				a3.sources.r3.ignorePattern = ([^ ]*\.tmp)

  				# Describe the sink
  				a3.sinks.k3.type = hdfs
  				a3.sinks.k3.hdfs.path = hdfs://192.168.122.20:8020/flume/upload/%Y%m%d/%H
  				#上傳文件的前綴
  				a3.sinks.k3.hdfs.filePrefix = upload-
  				#是否按照時間滾動文件夾
  				a3.sinks.k3.hdfs.round = true
  				#多少時間單位創建一個新的文件夾
  				a3.sinks.k3.hdfs.roundValue = 1
  				#重新定義時間單位
  				a3.sinks.k3.hdfs.roundUnit = hour
  				#是否使用本地時間戳
  				a3.sinks.k3.hdfs.useLocalTimeStamp = true
  				#積攢多少個Event才flush到HDFS一次
  				a3.sinks.k3.hdfs.batchSize = 1000
  				#設置文件類型，可支持壓縮
  				a3.sinks.k3.hdfs.fileType = DataStream
  				#多久生成一個新的文件
  				a3.sinks.k3.hdfs.rollInterval = 600
  				#設置每個文件的滾動大小
  				a3.sinks.k3.hdfs.rollSize = 134217700
  				#文件的滾動與Event數量無關
  				a3.sinks.k3.hdfs.rollCount = 0
  				#最小冗餘數
  				a3.sinks.k3.hdfs.minBlockReplicas = 1


  				# Use a channel which buffers events in memory
  				a3.channels.c3.type = memory
  				a3.channels.c3.capacity = 1000
  				a3.channels.c3.transactionCapacity = 100

  				# Bind the source and sink to the channel
  				a3.sources.r3.channels = c3
  				a3.sinks.k3.channel = c3
  		5.3.2、執行測試
  			$ bin/flume-ng agent --conf conf/ --name a3 --conf-file conf/flume-dir.conf &
  		總結：
  			在使用Spooling Directory Source
  			注意事項：
  				1、不要在監控目錄中創建並持續修改文件
  				2、上傳完成的文件會以.COMPLETED結尾
  				3、被監控文件夾每600毫秒掃描一次變動

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

大數據之Flume組件的使用

flume

flume 是什麼?

flume框架基礎

比特幣：一種基於麻將機的貨幣系統

Hbase安裝啓動

Ki No ''Azkaban'' Da

小白必看之Linux掛載硬盤啊啊啊啊啊啊啊

Zookeeper之本地模式部署(純文字,圖片黨慎入)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結