Apache Flume
一、概述
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
Flume分佈式、可靠、高效的數據採集、聚合和傳輸工具。具備容錯和故障恢復能力,可以進行實時數據採集,可用於開發在線分析應用;
通俗理解: 工具,不生產數據,只是數據搬運工
;
架構原理
名詞解釋
- Agent: 代理服務,代表一個Flume服務實例,多個Agent構成一個Flume集羣
- Source: Agent中一個核心組件,用以接受或者手機數據源產生的數據
- Channel: Agent中一個核心組件,用以臨時存儲Source採集的數據,當數據sink到存儲系統後,Channel中的數據會自動被刪除
- Sink:Agent中一個核心組件,用以連接Channel和外部存儲系統,將Channel中數據寫出到外部的存儲系統
- Event: 事件對象,它是Flume Agent中數據傳輸的最小單元,有Event Head和Event Body組成;
二、環境搭建
準備工作
- JDK1.8
- 足夠的內存 + 磁盤攻堅
- 目錄讀寫權限
安裝
[root@HadoopNode00 ~]# tar -zxf apache-flume-1.7.0-bin.tar.gz -C /usr
三、使用
配置文件語法【重點】
# example.conf: A single-node Flume configuration
# Name the components on this agent
# a1 agent別名【自定義】
# agent: source[r1] sink[k1] channel[c1]
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
# 收集TCP網絡端口請求數據 TCP Server
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444
# Describe the sink
# 將數據輸出到服務運行窗口
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Quick Start
沿用上面flume agent配置
作用:收集TCP請求數據,並將採集數據輸出到服務的控制檯窗口
啓動Flume Agent
[root@HadoopNode00 apache-flume-1.7.0-bin]# bin/flume-ng agent --conf-file conf/quickstart.properties --name a1 -Dflume.root.logger=INFO,console
數據服務器發送請求數據
Linux平臺
[root@HadoopNode00 ~]# yum install -y telnet
[root@HadoopNode00 ~]# telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Hello World
OK
Hello Flume
OK
Hello Hadoop
OK
Windows平臺
# 1. 打開控制面板
# 2. 程序 打開或者關閉windows功能
# 3. 勾選telnet客戶端
# 4. 重啓cmd窗口測試
telnet 192.168.197.10 44444
四、深入學習Flume的各個組件
http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#exec-source
Source
主要作用:接受收集外部數據源產生數據的一個組件
Netcat【測試,不重要】
網絡服務 啓動TCP Server收集 TCP Client請求數據
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 6666
a1.sources.r1.channels = c1
Exec【瞭解】
通過執行一個給定的Linux的操作指令,將指令執行結果作爲source組件的數據來源
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /root/Hello.log
Spooling Directory【重點】
收集採集某一個目錄的所有數據文件的內容,一旦數據文件採集完成,它的後綴後自動更改爲
.completed
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /root/data
注意:
spoolDirectory source
數據文件只會被採集一次exec source
使用命令tail -f xxx.log
, 數據重複採集,導致產生重複數據
Kafka Source【瞭解】
從kafka中拉取數據作爲source數據來源
tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
# 批處理數據大小
tier1.sources.source1.batchSize = 5000
# 批處理時間間隔2s
tier1.sources.source1.batchDurationMillis = 2000
# Kakfa地址 如果是kafka集羣,多個節點用","分割
tier1.sources.source1.kafka.bootstrap.servers = localhost:9092,localhost:9093
# kafka 訂閱主題
tier1.sources.source1.kafka.topics = test1, test2
# kafka消費者的所屬消費組
tier1.sources.source1.kafka.consumer.group.id = custom.g.idz
Avro【重點】
採集或者收集來自於外部Avro Client發送實時數據,通常在構建Flume集羣時會使用到Avro
# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 8888
發送測試數據
[root@HadoopNode00 apache-flume-1.7.0-bin]# bin/flume-ng avro-client --host 192.168.197.10 --port 8888 --filename /root/log.txt
Sequence Generator【不需要了解】
測試使用,用以生成
0~long.max_value
序列數據
Channel
通道,Source採集數據臨時存放Channel Event Queue; 主要功能類似於寫緩存或者寫緩衝;
Memory
Events會臨時存放在Memory Queue
優點:效率極高
缺點:數據可能會丟失
a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000
JDBC
將Events永久存放在內嵌數據庫服務Derby中,不支持其它數據庫系統(MySQL、Oracle)
a1.channels = c1
a1.channels.c1.type = jdbc
Kafka【重點】
將Events存放在Kafka 分佈式流數據平臺中
a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.kafka.bootstrap.servers = localhost:9092,localhost....
File【重點】
將Events存放在filesystem中
a1.channels = c1
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /mnt/flume/checkpoint
a1.channels.c1.dataDirs = /mnt/flume/data
Spillable Memory
內存溢寫的Channel,使用內存+磁盤存放Channel中的數據
a1.channels = c1
a1.channels.c1.type = SPILLABLEMEMORY
a1.channels.c1.memoryCapacity = 10000
a1.channels.c1.overflowCapacity = 1000000
a1.channels.c1.byteCapacity = 800000
a1.channels.c1.checkpointDir = /mnt/flume/checkpoint
a1.channels.c1.dataDirs = /mnt/flume/data
注意:刪除flume歷史數據[root@HadoopNode00 apache-flume-1.7.0-bin]# rm -rf ~/.flume/
Sink
將採集到的數據寫出到外部的存儲系統;
Logger
以INFO級別的日誌形式輸出Events,通常用於測試和調式
a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1
HBaseSinks
將採集到的數據寫出保存到HDFS BigTable中;
注意:
- 務必提前啓動zk和hdfs
- 如果使用的是HBase 2.0+版本,請使用
type=hbase2
# Describe the sink
a1.sinks.k1.type = hbase
a1.sinks.k1.table = ns:t_flume
a1.sinks.k1.columnFamily = cf1
File Roll
將採集到的數據存放到本地文件系統中
# Describe the sink
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /root/data2
# 禁用滾動
a1.sinks.k1.sink.rollInterval = 0
Null
丟棄採集的所有數據
a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = null
a1.sinks.k1.channel = c1
Avro Sink
將採集到數據發送給Avro Server
a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = 10.10.10.10
a1.sinks.k1.port = 4545
如何構建flume服務集羣
Agent【1】
[root@HadoopNode00 apache-flume-1.7.0-bin]# vi conf/s1.properties
# example.conf: A single-node Flume configuration
# Name the components on this agent
# a1 agent別名【自定義】
# agent: source[r1] sink[k1] channel[c1]
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 9999
# Describe the sink
# 將數據輸出到服務運行窗口
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 192.168.197.10
a1.sinks.k1.port = 7777
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Agent【2】
[root@HadoopNode00 apache-flume-1.7.0-bin]# vi conf/s2.properties
# example.conf: A single-node Flume configuration
# Name the components on this agent
# a1 agent別名【自定義】
# agent: source[r1] sink[k1] channel[c1]
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 7777
# Describe the sink
# 將數據輸出到服務運行窗口
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /root/data3
a1.sinks.k1.sink.rollInterval = 0
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
分別啓動服務
-
先啓動第二個
[root@HadoopNode00 apache-flume-1.7.0-bin]# bin/flume-ng agent --conf-file conf/s2.properties --name a1 -Dflume.root.logger=INFO,console
-
再啓動第一個
[root@HadoopNode00 apache-flume-1.7.0-bin]# bin/flume-ng agent --conf-file conf/s1.properties --name a1 -Dflume.root.logger=INFO,console
測試集羣
HDFS Sink【重點】
將採集到的數據保存到HDFS中
# Describe the sink
# 將數據輸出到服務運行窗口
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%Y-%m-%d/%H
a1.sinks.k1.hdfs.filePrefix = gaozhy
a1.sinks.k1.hdfs.fileSuffix = .txt
a1.sinks.k1.hdfs.rollInterval = 15
# 基於Events 數量滾動操作
a1.sinks.k1.hdfs.rollCount = 5
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.useLocalTimeStamp = true
Kafka Sink【重點】
將採集到數據寫出到Kafka集羣Topic中
a1.sinks.k1.channel = c1
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = mytopic
a1.sinks.k1.kafka.bootstrap.servers = localhost:9092
a1.sinks.k1.kafka.flumeBatchSize = 20
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 1
a1.sinks.k1.kafka.producer.compression.type = snappy
可選組件
攔截器(Interceptor)
Timestamp Interceptor
主要作用:在Event Header中添加KV,Key爲timestamp,Value爲數據採集時間戳
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp
---------------------------------------------------------------------------
19/12/19 23:18:24 INFO sink.LoggerSink: Event: { headers:{timestamp=1576768704952} body: 31 31 0D 11. }
19/12/19 23:18:25 INFO sink.LoggerSink: Event: { headers:{timestamp=1576768705631} body: 32 32 0D 22. }
19/12/19 23:18:26 INFO sink.LoggerSink: Event: { headers:{timestamp=1576768706571} body: 33 33 0D 33. }
Host Interceptor
主要作用:在Event Header中添加KV,Key爲host,Value爲ip地址或者主機名,標識數據來源
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = timestamp
a1.sources.r1.interceptors.i2.type = host
---------------------------------------------------------------------------
19/12/19 23:23:43 INFO sink.LoggerSink: Event: { headers:{host=192.168.197.10, timestamp=1576769023573} body: 31 31 0D 11. }
19/12/19 23:23:44 INFO sink.LoggerSink: Event: { headers:{host=192.168.197.10, timestamp=1576769024244} body: 32 32 0D 22. }
19/12/19 23:23:44 INFO sink.LoggerSink: Event: { headers:{host=192.168.197.10, timestamp=1576769024854} body: 33 33 0D 33. }
Static Interceptor
主要作用:在Event Header中添加KV,KeyValue爲靜態信息,也就是用戶自定義的KV信息
datacenter=bj
datacenter=sh
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1 i2 i3
a1.sources.r1.interceptors.i1.type = timestamp
a1.sources.r1.interceptors.i2.type = host
a1.sources.r1.interceptors.i3.type = static
a1.sources.r1.interceptors.i3.key = datacenter
a1.sources.r1.interceptors.i3.value = bj
---------------------------------------------------------------------------
19/12/19 23:30:34 INFO sink.LoggerSink: Event: { headers:{host=192.168.197.10, datacenter=bj, timestamp=1576769434926} body: 31 31 0D 11. }
19/12/19 23:30:35 INFO sink.LoggerSink: Event: { headers:{host=192.168.197.10, datacenter=bj, timestamp=1576769435649} body: 32 32 0D 22. }
19/12/19 23:30:36 INFO sink.LoggerSink: Event: { headers:{host=192.168.197.10, datacenter=bj, timestamp=1576769436316} body: 33 33 0D 33. }
Remove Header
有問題
UUID Interceptor
主要作用:在Event Header中添加一個UUID唯一標識
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1 i2 i3 i4
a1.sources.r1.interceptors.i1.type = timestamp
a1.sources.r1.interceptors.i2.type = host
a1.sources.r1.interceptors.i3.type = static
a1.sources.r1.interceptors.i3.key = datacenter
a1.sources.r1.interceptors.i3.value = bj
a1.sources.r1.interceptors.i4.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
---------------------------------------------------------------------------
19/12/19 23:42:32 INFO sink.LoggerSink: Event: { headers:{host=192.168.197.10, datacenter=bj, id=e7c8855c-6712-4877-ac00-33a2a05d6982, timestamp=1576770152337} body: 31 31 0D 11. }
19/12/19 23:42:33 INFO sink.LoggerSink: Event: { headers:{host=192.168.197.10, datacenter=bj, id=119029af-606f-47d8-9468-3f5ef78c4319, timestamp=1576770153078} body: 32 32 0D 22. }
Search and Replace
主要作用:搜索和替換的攔截器
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1 i2 i3
a1.sources.r1.interceptors.i1.type = timestamp
a1.sources.r1.interceptors.i2.type = host
a1.sources.r1.interceptors.i3.type = search_replace
a1.sources.r1.interceptors.i3.searchPattern = ^ERROR.*$
a1.sources.r1.interceptors.i3.replaceString = abc
---------------------------------------------------------------------------
19/12/19 23:53:29 INFO sink.LoggerSink: Event: { headers:{host=192.168.197.10, timestamp=1576770809449} body: 0D . }
19/12/19 23:53:38 INFO sink.LoggerSink: Event: { headers:{host=192.168.197.10, timestamp=1576770817952} body: 31 31 0D 11. }
19/12/19 23:53:38 INFO sink.LoggerSink: Event: { headers:{host=192.168.197.10, timestamp=1576770818567} body: 32 32 0D 22. }
19/12/19 23:53:53 INFO sink.LoggerSink: Event: { headers:{host=192.168.197.10, timestamp=1576770831967} body: 61 62 63 0D abc. }
19/12/19 23:54:15 INFO sink.LoggerSink: Event: { headers:{host=192.168.197.10, timestamp=1576770851674} body: 61 62 63 0D abc. }
Regex Filtering【重點】
正則過濾的攔截器, 符合正則表達式數據放行,不符合丟棄
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1 i2 i3
a1.sources.r1.interceptors.i1.type = regex_filter
a1.sources.r1.interceptors.i1.regex = ^ERROR.*$
a1.sources.r1.interceptors.i2.type = host
a1.sources.r1.interceptors.i3.type = static
a1.sources.r1.interceptors.i3.key = datacenter
a1.sources.r1.interceptors.i3.value = bj
Regex Extractor
抽取匹配到的字符串內容,添加在Event Header中
192.168.0.3 GET /login.jsp 404
ip地址
status狀態碼
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = regex_extractor
a1.sources.r1.interceptors.i1.regex = (\\d{1,3}.\\d{1,3}.\\d{1,3}.\\d{1,3})\\s\\w*\\s\\/.*\\s(\\d{3})$
a1.sources.r1.interceptors.i1.serializers = s1 s2
a1.sources.r1.interceptors.i1.serializers.s1.name = ip
a1.sources.r1.interceptors.i1.serializers.s2.name = code
//----------------------------------------------------------------------------------
19/12/20 00:56:31 INFO sink.LoggerSink: Event: { headers:{code=404, ip=192.168.0.3} body: 31 39 32 2E 31 36 38 2E 30 2E 33 20 47 45 54 20 192.168.0.3 GET }
19/12/20 00:56:53 INFO sink.LoggerSink: Event: { headers:{code=200, ip=192.168.254.254} body: 31 39 32 2E 31 36 38 2E 32 35 34 2E 32 35 34 20 192.168.254.254 }
通道選擇器( Channel Selectors)
Replicating Channel Selector
source採集到的數據會複製給所有的通道
# example.conf: A single-node Flume configuration
# Name the components on this agent
# a1 agent別名【自定義】
# agent: source[r1] sink[k1] channel[c1]
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# Describe/configure the source
# 收集TCP網絡端口請求數據 TCP Server
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444
a1.sources.r1.selector.type = replicating
# Describe the sink
# 將數據輸出到服務運行窗口
a1.sinks.k1.type = logger
a1.sinks.k2.type = file_roll
a1.sinks.k2.sink.directory = /root/data2
a1.sinks.k2.sink.rollInterval = 0
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
Multiplexing Channel (分發)
source採集到的數據會按照分發規則的不同,傳遞給不同的通道
# example.conf: A single-node Flume configuration
# Name the components on this agent
# a1 agent別名【自定義】
# agent: source[r1] sink[k1] channel[c1]
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# Describe/configure the source
# 收集TCP網絡端口請求數據 TCP Server
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = regex_extractor
a1.sources.r1.interceptors.i1.regex = ^(\\w*).*$
a1.sources.r1.interceptors.i1.serializers = s1
a1.sources.r1.interceptors.i1.serializers.s1.name = level
a1.sources.r1.selector.type = multiplexing
# event header中k v
a1.sources.r1.selector.header = level
a1.sources.r1.selector.mapping.ERROR = c1
a1.sources.r1.selector.mapping.INFO = c2
a1.sources.r1.selector.mapping.DEBUG = c2
a1.sources.r1.selector.default = c1
# Describe the sink
# 將數據輸出到服務運行窗口
a1.sinks.k1.type = logger
a1.sinks.k2.type = file_roll
a1.sinks.k2.sink.directory = /root/data2
a1.sinks.k2.sink.rollInterval = 0
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
Sink Processors(處理器)
也稱爲SinkGroup,將多個Sink組織爲一個Group;主要的作用是Sink寫出時的負載均衡和容錯
Failover Sink
容錯能力的SinkGroup,它根據sink的設定優先級,將數據寫出到優先級最高的目標存儲系統
問題:監控sink故障,而不能監控外部存儲系統故障
# example.conf: A single-node Flume configuration
# Name the components on this agent
# a1 agent別名【自定義】
# agent: source[r1] sink[k1] channel[c1]
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1
# Describe/configure the source
# 收集TCP網絡端口請求數據 TCP Server
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444
# Describe the sink
# 將數據輸出到服務運行窗口
a1.sinks.k1.type = logger
a1.sinks.k2.type = hdfs
a1.sinks.k2.hdfs.path = /flume/events/%Y-%m-%d/%H
a1.sinks.k2.hdfs.filePrefix = gaozhy
a1.sinks.k2.hdfs.fileSuffix = .txt
a1.sinks.k2.hdfs.rollInterval = 15
# 基於Events 數量滾動操作
a1.sinks.k2.hdfs.rollCount = 5
a1.sinks.k2.hdfs.fileType = DataStream
a1.sinks.k2.hdfs.useLocalTimeStamp = true
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 10
a1.sinkgroups.g1.processor.priority.k2 = 50
a1.sinkgroups.g1.processor.maxpenalty = 10000
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
ources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1
Load balancing Sink Processor
負載均衡的sink寫出; 針對於一組數據的負載均衡
# Describe/configure the source
# 收集TCP網絡端口請求數據 TCP Server
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444
# Describe the sink
# 將數據輸出到服務運行窗口
a1.sinks.k1.type = logger
a1.sinks.k2.type = file_roll
a1.sinks.k2.sink.directory = /root/data2
# 禁用滾動
a1.sinks.k2.sink.rollInterval = 0
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = round_robin
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
ources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1
Log4j Appender
導入依賴
<!-- https://mvnrepository.com/artifact/org.apache.flume/flume-ng-sdk -->
<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-sdk</artifactId>
<version>1.7.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-core -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.5</version>
</dependency>
<dependency>
<groupId>org.apache.flume.flume-ng-clients</groupId>
<artifactId>flume-ng-log4jappender</artifactId>
<version>1.7.0</version>
</dependency>
開發應用
package org.example;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
public class LogTest {
// log 日誌採集器
private static Log log = LogFactory.getLog(LogTest.class);
public static void main(String[] args) {
log.debug("this is debug"); // 記錄一個debug級別的日誌數據
log.info("this is info");
log.warn("this is warn");
log.error("this is error");
log.fatal("this is fatal");
}
}
作業
基於web服務器訪問日誌的數據分析系統
安裝Nginx服務器
# 1. 安裝nginx的運行依賴
[root@HadoopNode00 conf]# yum install gcc-c++ perl-devel pcre-devel openssl-devel zlib-devel wget
# 2. 上傳Nginx
# 3. 編譯安裝
[root@HadoopNode00 ~]# tar -zxf nginx-1.11.1.tar.gz
[root@HadoopNode00 ~]# cd nginx-1.11.1
[root@HadoopNode00 nginx-1.11.1]# ./configure --prefix=/usr/local/nginx
[root@HadoopNode00 nginx-1.11.1]# make && make install
# 4. 進入到nginx服務器的安裝目錄
[root@HadoopNode00 nginx-1.11.1]# cd /usr/local/nginx/
[root@HadoopNode00 nginx]# ll
total 16
drwxr-xr-x. 2 root root 4096 Dec 20 22:10 conf
drwxr-xr-x. 2 root root 4096 Dec 20 22:10 html
drwxr-xr-x. 2 root root 4096 Dec 20 22:10 logs
drwxr-xr-x. 2 root root 4096 Dec 20 22:10 sbin
# 5. 啓動Nginx服務器
[root@HadoopNode00 nginx]# sbin/nginx -c conf/nginx.conf
root@HadoopNode00 nginx]# ps -ef | grep nginx
root 62524 1 0 22:11 ? 00:00:00 nginx: master process sbin/nginx -c conf/nginx.conf
nobody 62525 62524 0 22:11 ? 00:00:00 nginx: worker process
root 62575 62403 0 22:11 pts/2 00:00:00 grep nginx
# 6. 訪問:http://hadoopnode00/
日誌切割
[root@HadoopNode00 nginx]# vi logsplit.sh
#nginx日誌切割腳本
#!/bin/bash
#設置日誌文件存放目錄
logs_path="/usr/local/nginx/logs/"
#設置pid文件
pid_path="/usr/local/nginx/logs/nginx.pid"
#重命名日誌文件
mv ${logs_path}access.log /usr/local/nginx/data/access_$(date -d "now" +"%Y-%m-%d-%I:%M:%S").log
# 向nginx主進程發信號重新打開日誌
# 信號量操作 USR1 通知nginx服務器在移動完訪問access.log後,使用新的access.log記錄用戶新的請求日誌
kill -USR1 `cat ${pid_path}`
[root@HadoopNode00 nginx]# chmod u+x logsplit.sh
[root@HadoopNode00 nginx]# ll
total 44
drwx------. 2 nobody root 4096 Dec 20 22:11 client_body_temp
drwxr-xr-x. 2 root root 4096 Dec 20 22:10 conf
drwxr-xr-x. 2 root root 4096 Dec 20 22:19 data
drwx------. 2 nobody root 4096 Dec 20 22:11 fastcgi_temp
drwxr-xr-x. 2 root root 4096 Dec 20 22:10 html
drwxr-xr-x. 2 root root 4096 Dec 20 22:11 logs
-rwxr--r--. 1 root root 493 Dec 20 22:27 logsplit.sh
drwx------. 2 nobody root 4096 Dec 20 22:11 proxy_temp
drwxr-xr-x. 2 root root 4096 Dec 20 22:10 sbin
drwx------. 2 nobody root 4096 Dec 20 22:11 scgi_temp
drwx------. 2 nobody root 4096 Dec 20 22:11 uwsgi_temp
定時任務
週期性的觸發日誌切割的操作,可以設定爲每天的
0:0
linux操作系統中的一個用來定義定時任務的命令:
crontab
# 1. 進入到定時任務的編輯模式,類似vi編輯器
[root@HadoopNode00 nginx]# crontab -e
# 2. 新增一個定時任務,用來週期性觸發split日誌操作
# 觸發時機,五個時間單位:分 時 日 月 周
# 0 0 * * * 每天的0點0分觸發日誌切割操作
# */1 * * * * 每隔1分鐘觸發日誌切割操作
*/1 * * * * /usr/local/nginx/logsplit.sh
定製Flume Agent
[root@HadoopNode00 apache-flume-1.7.0-bin]# vi conf/nginx.properties
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
# 收集TCP網絡端口請求數據 TCP Server
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /usr/local/nginx/data
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%Y-%m-%d/
a1.sinks.k1.hdfs.filePrefix = nginx
a1.sinks.k1.hdfs.fileSuffix = .log
a1.sinks.k1.hdfs.rollInterval = 15
# 基於Events 數量滾動操作
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
數據採集
[root@HadoopNode00 apache-flume-1.7.0-bin]# start-dfs.sh
[root@HadoopNode00 apache-flume-1.7.0-bin]# bin/flume-ng agent --conf-file conf/nginx.properties --name a1 -Dflume.root.logger=INFO,console
數據清洗
# 1. 原始數據
192.168.197.1 - - [20/Dec/2019:22:12:42 +0800] "GET / HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"
192.168.197.1 - - [20/Dec/2019:22:12:42 +0800] "GET /favicon.ico HTTP/1.1" 404 571 "http://hadoopnode00/" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"
# 2. 5個指標 客戶端的ip地址 請求時間 請求方式 uri 響應狀態碼
正則表達式
- 請參考正則表達式學習筆記
- 強烈安利:https://regex101.com/
正則表達式:
^(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}).*\[(.*)\]\s"(\w+)\s(.*)\sHTTP\/1.1"\s(\d{3})\s.*$
192.168.197.1 - - [20/Dec/2019:22:14:00 +0800] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"
計算(略)
數據可視化展示
通過圖形圖表可視化展示計算結果