Apache Flume

一、概述

http://flume.apache.org/

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.

Flume分佈式、可靠、高效的數據採集、聚合和傳輸工具。具備容錯和故障恢復能力，可以進行實時數據採集，可用於開發在線分析應用；

通俗理解：工具，不生產數據，只是數據搬運工;

架構原理

名詞解釋

Agent：代理服務，代表一個Flume服務實例，多個Agent構成一個Flume集羣
Source： Agent中一個核心組件，用以接受或者手機數據源產生的數據
Channel： Agent中一個核心組件，用以臨時存儲Source採集的數據，當數據sink到存儲系統後，Channel中的數據會自動被刪除
Sink：Agent中一個核心組件，用以連接Channel和外部存儲系統，將Channel中數據寫出到外部的存儲系統
Event：事件對象，它是Flume Agent中數據傳輸的最小單元，有Event Head和Event Body組成；

二、環境搭建

準備工作

JDK1.8
足夠的內存 + 磁盤攻堅
目錄讀寫權限

安裝

[root@HadoopNode00 ~]# tar -zxf apache-flume-1.7.0-bin.tar.gz -C /usr

三、使用

配置文件語法【重點】

# example.conf: A single-node Flume configuration

# Name the components on this agent
# a1 agent別名【自定義】
# agent： source[r1] sink[k1] channel[c1]
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
# 收集TCP網絡端口請求數據  TCP Server
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444

# Describe the sink
# 將數據輸出到服務運行窗口
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Quick Start

沿用上面flume agent配置

作用：收集TCP請求數據，並將採集數據輸出到服務的控制檯窗口

啓動Flume Agent

[root@HadoopNode00 apache-flume-1.7.0-bin]# bin/flume-ng agent --conf-file conf/quickstart.properties --name a1 -Dflume.root.logger=INFO,console

數據服務器發送請求數據

Linux平臺

[root@HadoopNode00 ~]# yum install -y telnet
[root@HadoopNode00 ~]# telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Hello World
OK
Hello Flume
OK
Hello Hadoop
OK

Windows平臺

# 1. 打開控制面板
# 2. 程序 打開或者關閉windows功能
# 3. 勾選telnet客戶端
# 4. 重啓cmd窗口測試

telnet 192.168.197.10 44444

四、深入學習Flume的各個組件

http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#exec-source

Source

主要作用：接受收集外部數據源產生數據的一個組件

Netcat【測試，不重要】

網絡服務啓動TCP Server收集 TCP Client請求數據

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 6666
a1.sources.r1.channels = c1

Exec【瞭解】

通過執行一個給定的Linux的操作指令，將指令執行結果作爲source組件的數據來源

a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /root/Hello.log

Spooling Directory【重點】

收集採集某一個目錄的所有數據文件的內容，一旦數據文件採集完成，它的後綴後自動更改爲.completed

a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /root/data

注意：

spoolDirectory source數據文件只會被採集一次

exec source 使用命令tail -f xxx.log, 數據重複採集，導致產生重複數據

Kafka Source【瞭解】

從kafka中拉取數據作爲source數據來源

tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
# 批處理數據大小
tier1.sources.source1.batchSize = 5000
# 批處理時間間隔2s
tier1.sources.source1.batchDurationMillis = 2000
# Kakfa地址 如果是kafka集羣，多個節點用","分割
tier1.sources.source1.kafka.bootstrap.servers = localhost:9092,localhost:9093
# kafka 訂閱主題
tier1.sources.source1.kafka.topics = test1, test2
# kafka消費者的所屬消費組
tier1.sources.source1.kafka.consumer.group.id = custom.g.idz

Avro【重點】

採集或者收集來自於外部Avro Client發送實時數據，通常在構建Flume集羣時會使用到Avro

# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 8888

發送測試數據

[root@HadoopNode00 apache-flume-1.7.0-bin]# bin/flume-ng avro-client --host 192.168.197.10 --port 8888 --filename /root/log.txt

Sequence Generator【不需要了解】

測試使用，用以生成0~long.max_value序列數據

Channel

通道，Source採集數據臨時存放Channel Event Queue；主要功能類似於寫緩存或者寫緩衝；

Memory

Events會臨時存放在Memory Queue

優點：效率極高

缺點：數據可能會丟失

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

JDBC

將Events永久存放在內嵌數據庫服務Derby中，不支持其它數據庫系統（MySQL、Oracle）

a1.channels = c1
a1.channels.c1.type = jdbc

Kafka【重點】

將Events存放在Kafka 分佈式流數據平臺中

a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.kafka.bootstrap.servers = localhost:9092,localhost....

File【重點】

將Events存放在filesystem中

a1.channels = c1
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /mnt/flume/checkpoint
a1.channels.c1.dataDirs = /mnt/flume/data

Spillable Memory

內存溢寫的Channel，使用內存+磁盤存放Channel中的數據

a1.channels = c1
a1.channels.c1.type = SPILLABLEMEMORY
a1.channels.c1.memoryCapacity = 10000
a1.channels.c1.overflowCapacity = 1000000
a1.channels.c1.byteCapacity = 800000
a1.channels.c1.checkpointDir = /mnt/flume/checkpoint
a1.channels.c1.dataDirs = /mnt/flume/data

注意：刪除flume歷史數據[root@HadoopNode00 apache-flume-1.7.0-bin]# rm -rf ~/.flume/

Sink

將採集到的數據寫出到外部的存儲系統；

Logger

以INFO級別的日誌形式輸出Events，通常用於測試和調式

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

HBaseSinks

將採集到的數據寫出保存到HDFS BigTable中；

注意：

務必提前啓動zk和hdfs
如果使用的是HBase 2.0+版本，請使用type=hbase2

# Describe the sink
a1.sinks.k1.type = hbase
a1.sinks.k1.table = ns:t_flume
a1.sinks.k1.columnFamily = cf1

File Roll

將採集到的數據存放到本地文件系統中

# Describe the sink
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /root/data2
# 禁用滾動
a1.sinks.k1.sink.rollInterval = 0

Null

丟棄採集的所有數據

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = null
a1.sinks.k1.channel = c1

Avro Sink

將採集到數據發送給Avro Server

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = 10.10.10.10
a1.sinks.k1.port = 4545

如何構建flume服務集羣

Agent【1】

[root@HadoopNode00 apache-flume-1.7.0-bin]# vi conf/s1.properties
# example.conf: A single-node Flume configuration

# Name the components on this agent
# a1 agent別名【自定義】
# agent： source[r1] sink[k1] channel[c1]
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 9999

# Describe the sink
# 將數據輸出到服務運行窗口
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 192.168.197.10
a1.sinks.k1.port = 7777

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Agent【2】

[root@HadoopNode00 apache-flume-1.7.0-bin]# vi conf/s2.properties
# example.conf: A single-node Flume configuration

# Name the components on this agent
# a1 agent別名【自定義】
# agent： source[r1] sink[k1] channel[c1]
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 7777

# Describe the sink
# 將數據輸出到服務運行窗口
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /root/data3
a1.sinks.k1.sink.rollInterval = 0

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

分別啓動服務

先啓動第二個

[root@HadoopNode00 apache-flume-1.7.0-bin]# bin/flume-ng agent --conf-file conf/s2.properties --name a1 -Dflume.root.logger=INFO,console

再啓動第一個

[root@HadoopNode00 apache-flume-1.7.0-bin]# bin/flume-ng agent --conf-file conf/s1.properties --name a1 -Dflume.root.logger=INFO,console

測試集羣

HDFS Sink【重點】

將採集到的數據保存到HDFS中

# Describe the sink
# 將數據輸出到服務運行窗口
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%Y-%m-%d/%H
a1.sinks.k1.hdfs.filePrefix = gaozhy
a1.sinks.k1.hdfs.fileSuffix = .txt
a1.sinks.k1.hdfs.rollInterval = 15
# 基於Events 數量滾動操作
a1.sinks.k1.hdfs.rollCount = 5
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.useLocalTimeStamp = true

Kafka Sink【重點】

將採集到數據寫出到Kafka集羣Topic中

a1.sinks.k1.channel = c1
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = mytopic
a1.sinks.k1.kafka.bootstrap.servers = localhost:9092
a1.sinks.k1.kafka.flumeBatchSize = 20
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 1
a1.sinks.k1.kafka.producer.compression.type = snappy

可選組件

攔截器（Interceptor）

Timestamp Interceptor

主要作用：在Event Header中添加KV，Key爲timestamp，Value爲數據採集時間戳

a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp

---------------------------------------------------------------------------
19/12/19 23:18:24 INFO sink.LoggerSink: Event: { headers:{timestamp=1576768704952} body: 31 31 0D   11. }
19/12/19 23:18:25 INFO sink.LoggerSink: Event: { headers:{timestamp=1576768705631} body: 32 32 0D   22. }
19/12/19 23:18:26 INFO sink.LoggerSink: Event: { headers:{timestamp=1576768706571} body: 33 33 0D   33. }

Host Interceptor

主要作用：在Event Header中添加KV，Key爲host，Value爲ip地址或者主機名，標識數據來源

a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = timestamp
a1.sources.r1.interceptors.i2.type = host

---------------------------------------------------------------------------
19/12/19 23:23:43 INFO sink.LoggerSink: Event: { headers:{host=192.168.197.10, timestamp=1576769023573} body: 31 31 0D                                        11. }
19/12/19 23:23:44 INFO sink.LoggerSink: Event: { headers:{host=192.168.197.10, timestamp=1576769024244} body: 32 32 0D                                        22. }
19/12/19 23:23:44 INFO sink.LoggerSink: Event: { headers:{host=192.168.197.10, timestamp=1576769024854} body: 33 33 0D                                        33. }

Static Interceptor

主要作用：在Event Header中添加KV，KeyValue爲靜態信息，也就是用戶自定義的KV信息

datacenter=bj

datacenter=sh

a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1 i2 i3
a1.sources.r1.interceptors.i1.type = timestamp
a1.sources.r1.interceptors.i2.type = host
a1.sources.r1.interceptors.i3.type = static
a1.sources.r1.interceptors.i3.key = datacenter
a1.sources.r1.interceptors.i3.value = bj


---------------------------------------------------------------------------
19/12/19 23:30:34 INFO sink.LoggerSink: Event: { headers:{host=192.168.197.10, datacenter=bj, timestamp=1576769434926} body: 31 31 0D                                        11. }
19/12/19 23:30:35 INFO sink.LoggerSink: Event: { headers:{host=192.168.197.10, datacenter=bj, timestamp=1576769435649} body: 32 32 0D                                        22. }
19/12/19 23:30:36 INFO sink.LoggerSink: Event: { headers:{host=192.168.197.10, datacenter=bj, timestamp=1576769436316} body: 33 33 0D                                        33. }

Remove Header

有問題

UUID Interceptor

主要作用：在Event Header中添加一個UUID唯一標識

a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1 i2 i3 i4
a1.sources.r1.interceptors.i1.type = timestamp
a1.sources.r1.interceptors.i2.type = host
a1.sources.r1.interceptors.i3.type = static
a1.sources.r1.interceptors.i3.key = datacenter
a1.sources.r1.interceptors.i3.value = bj
a1.sources.r1.interceptors.i4.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder

---------------------------------------------------------------------------
19/12/19 23:42:32 INFO sink.LoggerSink: Event: { headers:{host=192.168.197.10, datacenter=bj, id=e7c8855c-6712-4877-ac00-33a2a05d6982, timestamp=1576770152337} body: 31 31 0D                                        11. }
19/12/19 23:42:33 INFO sink.LoggerSink: Event: { headers:{host=192.168.197.10, datacenter=bj, id=119029af-606f-47d8-9468-3f5ef78c4319, timestamp=1576770153078} body: 32 32 0D                                        22. }

Search and Replace

主要作用：搜索和替換的攔截器

a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1 i2 i3
a1.sources.r1.interceptors.i1.type = timestamp
a1.sources.r1.interceptors.i2.type = host
a1.sources.r1.interceptors.i3.type = search_replace
a1.sources.r1.interceptors.i3.searchPattern = ^ERROR.*$
a1.sources.r1.interceptors.i3.replaceString = abc


---------------------------------------------------------------------------
19/12/19 23:53:29 INFO sink.LoggerSink: Event: { headers:{host=192.168.197.10, timestamp=1576770809449} body: 0D                                              . }
19/12/19 23:53:38 INFO sink.LoggerSink: Event: { headers:{host=192.168.197.10, timestamp=1576770817952} body: 31 31 0D                                        11. }
19/12/19 23:53:38 INFO sink.LoggerSink: Event: { headers:{host=192.168.197.10, timestamp=1576770818567} body: 32 32 0D                                        22. }
19/12/19 23:53:53 INFO sink.LoggerSink: Event: { headers:{host=192.168.197.10, timestamp=1576770831967} body: 61 62 63 0D                                     abc. }
19/12/19 23:54:15 INFO sink.LoggerSink: Event: { headers:{host=192.168.197.10, timestamp=1576770851674} body: 61 62 63 0D                                     abc. }

Regex Filtering【重點】

正則過濾的攔截器，符合正則表達式數據放行，不符合丟棄

a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1 i2 i3
a1.sources.r1.interceptors.i1.type = regex_filter
a1.sources.r1.interceptors.i1.regex = ^ERROR.*$
a1.sources.r1.interceptors.i2.type = host
a1.sources.r1.interceptors.i3.type = static
a1.sources.r1.interceptors.i3.key = datacenter
a1.sources.r1.interceptors.i3.value = bj

Regex Extractor

抽取匹配到的字符串內容，添加在Event Header中

192.168.0.3 GET /login.jsp 404

ip地址
status狀態碼

a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = regex_extractor
a1.sources.r1.interceptors.i1.regex = (\\d{1,3}.\\d{1,3}.\\d{1,3}.\\d{1,3})\\s\\w*\\s\\/.*\\s(\\d{3})$
a1.sources.r1.interceptors.i1.serializers = s1 s2
a1.sources.r1.interceptors.i1.serializers.s1.name = ip
a1.sources.r1.interceptors.i1.serializers.s2.name = code


//----------------------------------------------------------------------------------
19/12/20 00:56:31 INFO sink.LoggerSink: Event: { headers:{code=404, ip=192.168.0.3} body: 31 39 32 2E 31 36 38 2E 30 2E 33 20 47 45 54 20 192.168.0.3 GET  }
19/12/20 00:56:53 INFO sink.LoggerSink: Event: { headers:{code=200, ip=192.168.254.254} body: 31 39 32 2E 31 36 38 2E 32 35 34 2E 32 35 34 20 192.168.254.254  }

通道選擇器( Channel Selectors)

Replicating Channel Selector

source採集到的數據會複製給所有的通道

# example.conf: A single-node Flume configuration

# Name the components on this agent
# a1 agent別名【自定義】
# agent： source[r1] sink[k1] channel[c1]
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

# Describe/configure the source
# 收集TCP網絡端口請求數據  TCP Server
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444
a1.sources.r1.selector.type = replicating


# Describe the sink
# 將數據輸出到服務運行窗口
a1.sinks.k1.type = logger
a1.sinks.k2.type = file_roll
a1.sinks.k2.sink.directory = /root/data2
a1.sinks.k2.sink.rollInterval = 0

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100


# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

Multiplexing Channel (分發)

source採集到的數據會按照分發規則的不同，傳遞給不同的通道

# example.conf: A single-node Flume configuration

# Name the components on this agent
# a1 agent別名【自定義】
# agent： source[r1] sink[k1] channel[c1]
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

# Describe/configure the source
# 收集TCP網絡端口請求數據  TCP Server
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444

a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = regex_extractor
a1.sources.r1.interceptors.i1.regex = ^(\\w*).*$
a1.sources.r1.interceptors.i1.serializers = s1
a1.sources.r1.interceptors.i1.serializers.s1.name = level

a1.sources.r1.selector.type = multiplexing
# event header中k v
a1.sources.r1.selector.header = level
a1.sources.r1.selector.mapping.ERROR = c1
a1.sources.r1.selector.mapping.INFO = c2
a1.sources.r1.selector.mapping.DEBUG = c2
a1.sources.r1.selector.default = c1

# Describe the sink
# 將數據輸出到服務運行窗口
a1.sinks.k1.type = logger
a1.sinks.k2.type = file_roll
a1.sinks.k2.sink.directory = /root/data2
a1.sinks.k2.sink.rollInterval = 0

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100


# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

Sink Processors（處理器）

也稱爲SinkGroup，將多個Sink組織爲一個Group；主要的作用是Sink寫出時的負載均衡和容錯

Failover Sink

容錯能力的SinkGroup，它根據sink的設定優先級，將數據寫出到優先級最高的目標存儲系統

問題：監控sink故障，而不能監控外部存儲系統故障

# example.conf: A single-node Flume configuration

# Name the components on this agent
# a1 agent別名【自定義】
# agent： source[r1] sink[k1] channel[c1]
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1

# Describe/configure the source
# 收集TCP網絡端口請求數據  TCP Server
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444

# Describe the sink
# 將數據輸出到服務運行窗口
a1.sinks.k1.type = logger

a1.sinks.k2.type = hdfs
a1.sinks.k2.hdfs.path = /flume/events/%Y-%m-%d/%H
a1.sinks.k2.hdfs.filePrefix = gaozhy
a1.sinks.k2.hdfs.fileSuffix = .txt
a1.sinks.k2.hdfs.rollInterval = 15
# 基於Events 數量滾動操作
a1.sinks.k2.hdfs.rollCount = 5
a1.sinks.k2.hdfs.fileType = DataStream
a1.sinks.k2.hdfs.useLocalTimeStamp = true

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 10
a1.sinkgroups.g1.processor.priority.k2 = 50
a1.sinkgroups.g1.processor.maxpenalty = 10000

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
ources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

Load balancing Sink Processor

負載均衡的sink寫出；針對於一組數據的負載均衡

# Describe/configure the source
# 收集TCP網絡端口請求數據  TCP Server
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444

# Describe the sink
# 將數據輸出到服務運行窗口
a1.sinks.k1.type = logger

a1.sinks.k2.type = file_roll
a1.sinks.k2.sink.directory = /root/data2
# 禁用滾動
a1.sinks.k2.sink.rollInterval = 0

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = round_robin


# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
ources.r1.type = netcat
a1.sources.r1.bind = 192.168.197.10
a1.sources.r1.port = 44444
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

Log4j Appender

導入依賴

  <!-- https://mvnrepository.com/artifact/org.apache.flume/flume-ng-sdk -->
<dependency>
    <groupId>org.apache.flume</groupId>
    <artifactId>flume-ng-sdk</artifactId>
    <version>1.7.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-core -->
<dependency>
    <groupId>org.slf4j</groupId>
    <artifactId>slf4j-log4j12</artifactId>
    <version>1.7.5</version>
</dependency>
<dependency>
    <groupId>org.apache.flume.flume-ng-clients</groupId>
    <artifactId>flume-ng-log4jappender</artifactId>
    <version>1.7.0</version>
</dependency>

開發應用

package org.example;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

public class LogTest {
    // log 日誌採集器
    private static Log log = LogFactory.getLog(LogTest.class);

    public static void main(String[] args) {
        log.debug("this is debug"); // 記錄一個debug級別的日誌數據
        log.info("this is info");
        log.warn("this is warn");
        log.error("this is error");
        log.fatal("this is fatal");
    }
}

作業

基於web服務器訪問日誌的數據分析系統

安裝Nginx服務器

# 1. 安裝nginx的運行依賴
[root@HadoopNode00 conf]# yum install gcc-c++ perl-devel pcre-devel openssl-devel zlib-devel wget

# 2. 上傳Nginx

# 3. 編譯安裝
[root@HadoopNode00 ~]# tar -zxf nginx-1.11.1.tar.gz
[root@HadoopNode00 ~]# cd nginx-1.11.1
[root@HadoopNode00 nginx-1.11.1]# ./configure --prefix=/usr/local/nginx
[root@HadoopNode00 nginx-1.11.1]# make && make install

# 4. 進入到nginx服務器的安裝目錄

[root@HadoopNode00 nginx-1.11.1]# cd /usr/local/nginx/
[root@HadoopNode00 nginx]# ll
total 16
drwxr-xr-x. 2 root root 4096 Dec 20 22:10 conf
drwxr-xr-x. 2 root root 4096 Dec 20 22:10 html
drwxr-xr-x. 2 root root 4096 Dec 20 22:10 logs
drwxr-xr-x. 2 root root 4096 Dec 20 22:10 sbin

# 5. 啓動Nginx服務器
[root@HadoopNode00 nginx]# sbin/nginx -c conf/nginx.conf
root@HadoopNode00 nginx]# ps -ef | grep nginx
root     62524     1  0 22:11 ?        00:00:00 nginx: master process sbin/nginx -c conf/nginx.conf
nobody   62525 62524  0 22:11 ?        00:00:00 nginx: worker process
root     62575 62403  0 22:11 pts/2    00:00:00 grep nginx

# 6. 訪問：http://hadoopnode00/

日誌切割

[root@HadoopNode00 nginx]# vi logsplit.sh
#nginx日誌切割腳本
#!/bin/bash
#設置日誌文件存放目錄
logs_path="/usr/local/nginx/logs/"

#設置pid文件
pid_path="/usr/local/nginx/logs/nginx.pid"

#重命名日誌文件
mv ${logs_path}access.log /usr/local/nginx/data/access_$(date -d "now" +"%Y-%m-%d-%I:%M:%S").log

# 向nginx主進程發信號重新打開日誌
# 信號量操作 USR1 通知nginx服務器在移動完訪問access.log後，使用新的access.log記錄用戶新的請求日誌
kill -USR1 `cat ${pid_path}`

[root@HadoopNode00 nginx]# chmod u+x logsplit.sh
[root@HadoopNode00 nginx]# ll
total 44
drwx------. 2 nobody root 4096 Dec 20 22:11 client_body_temp
drwxr-xr-x. 2 root   root 4096 Dec 20 22:10 conf
drwxr-xr-x. 2 root   root 4096 Dec 20 22:19 data
drwx------. 2 nobody root 4096 Dec 20 22:11 fastcgi_temp
drwxr-xr-x. 2 root   root 4096 Dec 20 22:10 html
drwxr-xr-x. 2 root   root 4096 Dec 20 22:11 logs
-rwxr--r--. 1 root   root  493 Dec 20 22:27 logsplit.sh
drwx------. 2 nobody root 4096 Dec 20 22:11 proxy_temp
drwxr-xr-x. 2 root   root 4096 Dec 20 22:10 sbin
drwx------. 2 nobody root 4096 Dec 20 22:11 scgi_temp
drwx------. 2 nobody root 4096 Dec 20 22:11 uwsgi_temp

定時任務

週期性的觸發日誌切割的操作，可以設定爲每天的0:0

linux操作系統中的一個用來定義定時任務的命令：crontab

# 1. 進入到定時任務的編輯模式，類似vi編輯器
[root@HadoopNode00 nginx]# crontab -e

# 2. 新增一個定時任務，用來週期性觸發split日誌操作
# 觸發時機，五個時間單位：分 時 日 月 周
# 0 0 * * *   每天的0點0分觸發日誌切割操作
# */1 * * * *  每隔1分鐘觸發日誌切割操作
*/1 * * * * /usr/local/nginx/logsplit.sh

定製Flume Agent

[root@HadoopNode00 apache-flume-1.7.0-bin]# vi conf/nginx.properties
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
# 收集TCP網絡端口請求數據  TCP Server
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /usr/local/nginx/data

a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%Y-%m-%d/
a1.sinks.k1.hdfs.filePrefix = nginx
a1.sinks.k1.hdfs.fileSuffix = .log
a1.sinks.k1.hdfs.rollInterval = 15
# 基於Events 數量滾動操作

a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.useLocalTimeStamp = true

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

數據採集

[root@HadoopNode00 apache-flume-1.7.0-bin]# start-dfs.sh
[root@HadoopNode00 apache-flume-1.7.0-bin]# bin/flume-ng agent --conf-file conf/nginx.properties --name a1 -Dflume.root.logger=INFO,console

數據清洗

# 1. 原始數據
192.168.197.1 - - [20/Dec/2019:22:12:42 +0800] "GET / HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"
192.168.197.1 - - [20/Dec/2019:22:12:42 +0800] "GET /favicon.ico HTTP/1.1" 404 571 "http://hadoopnode00/" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"

# 2. 5個指標 客戶端的ip地址  請求時間  請求方式  uri  響應狀態碼

正則表達式

請參考正則表達式學習筆記
強烈安利：https://regex101.com/

正則表達式：
^(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}).*\[(.*)\]\s"(\w+)\s(.*)\sHTTP\/1.1"\s(\d{3})\s.*$

192.168.197.1 - - [20/Dec/2019:22:14:00 +0800] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"

計算（略）

數據可視化展示

通過圖形圖表可視化展示計算結果

https://www.highcharts.com.cn/docs