1.Flume內部原理

1.1Flume基礎架構

1.2Flume組件

1.2.1Agent

Agent 是一個 JVM 進程，它以事件的形式將數據從源頭送至目的。
Agent 主要有 3 個部分組成，Source、Channel、Sink。

1.2.2Source

Source 是負責接收數據到 Flume Agent 的組件。Source 組件可以處理各種類型、各種
格式的日誌數據，包括 avro、thrift、exec、jms、spooling directory、netcat、sequence
generator、syslog、http、legacy。

1.2.3Sink

Sink 不斷地輪詢 Channel 中的事件且批量地移除它們，並將這些事件批量寫入到存儲
或索引系統、或者被髮送到另一個 Flume Agent。
Sink 組件目的地包括 hdfs、logger、avro、thrift、ipc、file、HBase、solr、自定
義。

1.2.4Channel

Channel 是位於 Source 和 Sink 之間的緩衝區。因此，Channel 允許 Source 和 Sink 運
作在不同的速率上。Channel 是線程安全的，可以同時處理幾個 Source 的寫入操作和幾個
Sink 的讀取操作。
Flume 自帶兩種 Channel：Memory Channel 和 File Channel 以及 Kafka Channel。
Memory Channel 是內存中的隊列。Memory Channel 在不需要關心數據丟失的情景下適
用。如果需要關心數據丟失，那麼 Memory Channel 就不應該使用，因爲程序死亡、機器宕
機或者重啓都會導致數據丟失。
File Channel 將所有事件寫到磁盤。因此在程序關閉或機器宕機的情況下不會丟失數
據。

1.2.5 Event

傳輸單元，Flume 數據傳輸的基本單元，以 Event 的形式將數據從源頭送至目的地。
Event 由 Header 和 Body 兩部分組成，Header 用來存放該 event 的一些屬性，爲 K-V 結構，
Body 用來存放該條數據，形式爲字節數組。

1.3Flume事務

1.4Flume Agent內部原理

ChannelSelector 的作用就是選出 Event 將要被髮往哪個 Channel。其共有兩種類型，
分別是 Replicating（複製）和 Multiplexing（多路複用）。
ReplicatingSelector 會將同一個 Event 發往所有的 Channel，Multiplexing 會根據相應的原則，將不同的 Event 發往不同的 Channel。
SinkProcessor 共有三種類型，分別是 DefaultSinkProcessor 、
LoadBalancingSinkProcessor 和 FailoverSinkProcessor
DefaultSinkProcessor 對應的是單個的 Sink ， LoadBalancingSinkProcessor 和 FailoverSinkProcessor 對應的是 Sink Group，LoadBalancingSinkProcessor 可以實現負
載均衡的功能，FailoverSinkProcessor 可以實現故障轉移的功能。

1.5Flume拓撲結構

1.5.1簡單串聯

這種模式是將多個 flume 順序連接起來了，從最初的 source 開始到最終 sink 傳送的
目的存儲系統。此模式不建議橋接過多的 flume 數量，flume 數量過多不僅會影響傳輸速率，
而且一旦傳輸過程中某個節點 flume 宕機，會影響整個傳輸系統。

1.5.2複製和多路複用

Flume 支持將事件流向一個或者多個目的地。這種模式可以將相同數據複製到多個channel 中，或者將不同數據分發到不同的 channel 中，sink 可以選擇傳送到不同的目的
地。

1.5.3負載均衡和故障轉移

Flume支持使用將多個sink邏輯上分到一個sink組，sink組配合不同的SinkProcessor
可以實現負載均衡和錯誤恢復的功能。

1.5.4聚合

這種模式是我們最常見的，也非常實用，日常 web 應用通常分佈在上百個服務器，大者甚至上千個、上萬個服務器。產生的日誌，處理起來也非常麻煩。用 flume 的這種組合方式能很好的解決這一問題，每臺服務器部署一個 flume 採集日誌，傳送到一個集中收集日誌的flume，再由此 flume 上傳到 hdfs、hive、hbase 等，進行日誌分析。

2.Flume實際案例（大數據常用）

2.1第一個官方小案例

需求：使用 Flume 監聽一個端口，收集該端口數據，並打印到控制檯。
分析：通過netcat工具像本機的44444端口發送數據，flume監控44444端口。通過source讀取數據，再通過sink寫出到控制檯
實現步驟

## 安裝netcat
sudo yum install -y nc
## 檢查44444端口是否被佔用
sudo netstat -tunlp | grep 44444
## 配置文件
# Name the components on this agent
a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1

a1:表示agent的名稱
r1:表示a1的Source的名稱
k1:表示a1的Sink的名稱
c1:表示a1的Channel的名稱
表示a1的輸入源類型爲netcat端口類型
表示a1的監聽的主機
表示a1的監聽的端口號
表示a1的輸出目的地是控制檯logger類型
表示a1的channel類型是memory內存型
表示a1的channel總容量1000個event
表示a1的channel傳輸時收集到了100條event以後再去提交事務
表示將r1和c1連接起來
表示將k1和c1連接起來

## 開啓監聽窗口
## 第一種
bin/flume-ng agent --conf conf/ --name 
a1 --conf-file job/flume-netcat-logger.conf -
Dflume.root.logger=INFO,console

## 第二種
bin/flume-ng agent -c conf/ -n a1 -f job/flume-netcat-logger.conf -Dflume.root.logger=INFO,console
## 參數解析
--conf/-c：表示配置文件存儲在 conf/目錄
--name/-n：表示給 agent 起名爲 a1
--conf-file/-f：flume 本次啓動讀取的配置文件是在 job 文件夾下的 flume-telnet.conf
文件。-Dflume.root.logger=INFO,console ：-D 表示 flume 運行時動態修改 flume.root.logger
參數屬性值，並將控制檯日誌打印級別設置爲 INFO 級別。日誌級別包括:log、info、warn、
error。
## 使用netcat工具發送數據
123
hello

2.2實時讀取單個文件到HDFS

# Name the components on this agent
a2.sources = r2
a2.sinks = k2
a2.channels = c2
## Describe/configure the source
a2.sources.r2.type = exec
a2.sources.r2.command = tail -F /opt/module/hive/logs/hive.log
a2.sources.r2.shell = /bin/bash -c
## Describe the sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://hadoop102:9000/flume/%Y%m%d/%H
#上傳文件的前綴
a2.sinks.k2.hdfs.filePrefix = logs- #是否按照時間滾動文件夾
a2.sinks.k2.hdfs.round = true
#多少時間單位創建一個新的文件夾
a2.sinks.k2.hdfs.roundValue = 1
#重新定義時間單位
a2.sinks.k2.hdfs.roundUnit = hour
#是否使用本地時間戳
a2.sinks.k2.hdfs.useLocalTimeStamp = true
#積攢多少個 Event 才 flush 到 HDFS 一次
a2.sinks.k2.hdfs.batchSize = 1000
#設置文件類型，可支持壓縮
a2.sinks.k2.hdfs.fileType = DataStream
#多久生成一個新的文件
a2.sinks.k2.hdfs.rollInterval = 30
#設置每個文件的滾動大小
a2.sinks.k2.hdfs.rollSize = 134217700
#文件的滾動與 Event 數量無關
a2.sinks.k2.hdfs.rollCount = 0
# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

注意：要有Hadoop的jar包

2.3實時讀取多個追加文件到HDFS

a3.sources = r3
a3.sinks = k3
a3.channels = c3
# Describe/configure the source
a3.sources.r3.type = TAILDIR
a3.sources.r3.positionFile = /opt/module/flume/tail_dir.json
a3.sources.r3.filegroups = f1
a3.sources.r3.filegroups.f1 = /opt/module/flume/files/file.*
# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = 
hdfs://hadoop102:9000/flume/upload/%Y%m%d/%H
#上傳文件的前綴
a3.sinks.k3.hdfs.filePrefix = upload- #是否按照時間滾動文件夾
a3.sinks.k3.hdfs.round = true
#多少時間單位創建一個新的文件夾
a3.sinks.k3.hdfs.roundValue = 1
#重新定義時間單位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地時間戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#積攢多少個 Event 才 flush 到 HDFS 一次
a3.sinks.k3.hdfs.batchSize = 100
#設置文件類型，可支持壓縮
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一個新的文件
a3.sinks.k3.hdfs.rollInterval = 60
#設置每個文件的滾動大小大概是 128M
a3.sinks.k3.hdfs.rollSize = 134217700 #文件的滾動與 Event 數量無關
a3.sinks.k3.hdfs.rollCount = 0
# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

2.4實時讀取目錄到HDFS

a3.sources = r3
a3.sinks = k3
a3.channels = c3
# Describe/configure the source
a3.sources.r3.type = TAILDIR
a3.sources.r3.positionFile = /opt/module/flume/tail_dir.json
a3.sources.r3.filegroups = f1
a3.sources.r3.filegroups.f1 = /opt/module/flume/files/file.*
# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = 
hdfs://hadoop102:9000/flume/upload/%Y%m%d/%H
#上傳文件的前綴
a3.sinks.k3.hdfs.filePrefix = upload- #是否按照時間滾動文件夾
a3.sinks.k3.hdfs.round = true
#多少時間單位創建一個新的文件夾
a3.sinks.k3.hdfs.roundValue = 1
#重新定義時間單位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地時間戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#積攢多少個 Event 才 flush 到 HDFS 一次
a3.sinks.k3.hdfs.batchSize = 100
#設置文件類型，可支持壓縮
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一個新的文件
a3.sinks.k3.hdfs.rollInterval = 60
#設置每個文件的滾動大小大概是 128M
a3.sinks.k3.hdfs.rollSize = 134217700 #文件的滾動與 Event 數量無關
a3.sinks.k3.hdfs.rollCount = 0
# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

2.5實時監控單個追加文件到Kafka

log-monitoring.sources = exec-source
log-monitoring.sinks = kafka-sink
log-monitoring.channels = memory-channel

log-monitoring.sources.exec-source.type = exec
log-monitoring.sources.exec-source.command = tail -F /root/lastTask/access.log
log-monitoring.sources.exec-source.shell = /bin/sh -c

log-monitoring.sinks.kafka-sink.type = org.apache.flume.sink.kafka.KafkaSink
log-monitoring.sinks.kafka-sink.brokerList = bigdata1:9092,bigdata2:9092，bigdata3:9092
log-monitoring.sinks.kafka-sink.topic = log_monitoring
log-monitoring.sinks.kafka-sink.batchSize = 5
log-monitoring.sinks.kafka-sink.requiredAcks = 1

log-monitoring.channels.memory-channel.type = memory
log-monitoring.channels.memory-channel.capacity = 1000
log-monitoring.channels.memory-channel.transactionCapacity = 100

log-monitoring.sources.exec-source.channels = memory-channel
log-monitoring.sinks.kafka-sink.channel = memory-channel

05-Flume基礎

文章目錄

1.Flume內部原理

1.1Flume基礎架構

1.2Flume組件

1.2.1Agent

1.2.2Source

1.2.3Sink

1.2.4Channel

1.2.5 Event

1.3Flume事務

1.4Flume Agent內部原理

1.5Flume拓撲結構

1.5.1簡單串聯

1.5.2複製和多路複用

1.5.3負載均衡和故障轉移

1.5.4聚合

2.Flume實際案例（大數據常用）

2.1第一個官方小案例

2.2實時讀取單個文件到HDFS

2.3實時讀取多個追加文件到HDFS

2.4實時讀取目錄到HDFS

2.5實時監控單個追加文件到Kafka

08-Spark core基礎

03-zookeeper基礎

02-shell基礎

06-Hbase基礎

07-Hive基礎

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結