flume配置和說明

Flume是什麼

收集、聚合事件流數據的分佈式框架
通常用於log數據
採用ad-hoc方案，明顯優點如下：
- 可靠的、可伸縮、可管理、可定製、高性能
- 聲明式配置，可以動態更新配置
- 提供上下文路由功能
- 支持負載均衡和故障轉移
- 功能豐富
- 完全的可擴展

核心概念

Event
Client
Agent
- Sources、Channels、Sinks
- 其他組件：Interceptors、Channel Selectors、Sink Processor

核心概念：Event

Event是Flume數據傳輸的基本單元。flume以事件的形式將數據從源頭傳送到最終的目的。Event由可選的hearders和載有數據的一個byte array構成。

載有的數據對flume是不透明的
Headers是容納了key-value字符串對的無序集合，key在集合內是唯一的。
Headers可以在上下文路由中使用擴展

public interface Event {

public Map<String, String>getHeaders();

public void setHeaders(Map<String, String>headers);

public byte[] getBody();

public void setBody(byte[] body);

}

核心概念：Client

Client是一個將原始log包裝成events並且發送它們到一個或多個agent的實體。

例如
- Flume log4j Appender
- 可以使用Client SDK (org.apache.flume.api)定製特定的Client
目的是從數據源系統中解耦Flume
在flume的拓撲結構中不是必須的

核心概念：Agent

一個Agent包含Sources, Channels, Sinks和其他組件，它利用這些組件將events從一個節點傳輸到另一個節點或最終目的。

agent是flume流的基礎部分。
flume爲這些組件提供了配置、生命週期管理、監控支持。

核心概念：Source

Source負責接收events或通過特殊機制產生events，並將events批量的放到一個或多個Channels。有event驅動和輪詢2種類型的Source

不同類型的Source:
- 和衆所周知的系統集成的Sources: Syslog, Netcat
- 自動生成事件的Sources: Exec, SEQ
- 用於Agent和Agent之間通信的IPC Sources: Avro
Source必須至少和一個channel關聯

核心概念：Channel

Channel位於Source和Sink之間，用於緩存進來的events，當Sink成功的將events發送到下一跳的channel或最終目的，events從Channel移除。

不同的Channels提供的持久化水平也是不一樣的:
- Memory Channel: volatile
- File Channel: 基於WAL（預寫式日誌Write-Ahead Logging）實現
- JDBC Channel: 基於嵌入Database實現
Channels支持事務
提供較弱的順序保證
可以和任何數量的Source和Sink工作

核心概念：Sink

Sink負責將events傳輸到下一跳或最終目的，成功完成後將events從channel移除。

不同類型的Sinks:
- 存儲events到最終目的的終端Sink. 比如: HDFS, HBase
- 自動消耗的Sinks. 比如: Null Sink
- 用於Agent間通信的IPC sink: Avro
必須作用與一個確切的channel

Flow可靠性

可靠性基於:
- Agent間事務的交換
- Flow中，Channel的持久特性
可用性:
- 內建的Load balancing支持
- 內建的Failover支持

核心概念：Interceptor

用於Source的一組Interceptor，按照預設的順序在必要地方裝飾和過濾events。

內建的Interceptors允許增加event的headers比如：時間戳、主機名、靜態標記等等
定製的interceptors可以通過內省event payload（讀取原始日誌），在必要的地方創建一個特定的headers。

核心概念：Channel Selector

Channel Selector允許Source基於預設的標準，從所有Channel中，選擇一個或多個Channel

內建的Channel Selectors:
- 複製Replicating: event被複制到相關的channel
- 複用Multiplexing: 基於hearder，event被路由到特定的channel

核心概念：Sink Processor

多個Sink可以構成一個Sink Group。一個Sink Processor負責從一個指定的Sink Group中激活一個Sink。Sink Processor可以通過組中所有Sink實現負載均衡；也可以在一個Sink失敗時轉移到另一個。

Flume通過Sink Processor實現負載均衡（Load Balancing）和故障轉移（failover）
內建的Sink Processors:
- Load Balancing Sink Processor – 使用RANDOM, ROUND_ROBIN或定製的選擇算法
- Failover Sink Processor
- Default Sink Processor（單Sink）
所有的Sink都是採取輪詢（polling）的方式從Channel上獲取events。這個動作是通過Sink Runner激活的
Sink Processor充當Sink的一個代理

總結

Flume安裝部署

Flume日誌收集和接收注意權限問題，因爲ubuntu默認不是root用戶，所以有些日誌如系統日誌需要root權限，這裏要注意

Flume日誌收集：

腳本啓動：

nohup flume-ng agent -n agent-c conf -f flume/conf/flume-node.conf &（後臺根據agent名和配置文件啓動）

nohup flume-ng agent –n agent –c conf –f flume/conf/flume-master.conf& (前面的agent是表示agent,-n後面的agent是配置文件裏取的名字)

下面配置多機傳輸到hadoop集羣：

master配置：

最終參數：

agent.sources = source1

agent.channels = memoryChannel

agent.sinks = sink1

# For each one of the sources, the type isdefined

agent.sources.source1.type = avro

#監控本機ip和端口，接收日誌

agent.sources.source1.bind = 172.28.0.61

agent.sources.source1.port = 23004

##使用內存通道

agent.sources.source1.channels =memoryChannel

# Each sink's type must be defined

#agent.sinks.loggerSink.channel =memoryChannel

# Each channel's type is defined.

agent.channels.memoryChannel.type = memory

# Other config values specific to each typeof channel(sink or source)

# can be defined as well

# In this case, it specifies the capacityof the memory channel

agent.channels.memoryChannel.capacity =10000

agent.channels.memoryChannel.transactionCapacity= 10000

agent.channels.memoryChannel.keep-alive =1000

agent.sinks.sink1.type=hdfs

#agent.sinks.sink1.hdfs.path=hdfs://172.28.0.61:9000/hmbbs/%y-%m-%d/%H%M%S

#寫入到hadoop集羣

agent.sinks.sink1.hdfs.path=hdfs://172.28.0.61:9000/hmbbs/%y-%m-%d

agent.sinks.sink1.hdfs.fileType=DataStream

agent.sinks.sink1.hdfs.writeFormat=TEXT

agent.sinks.sink1.hdfs.round=true

agent.sinks.sink1.hdfs.roundValue=5

agent.sinks.sink1.hdfs.roundUnit=minute

agent.sinks.sink1.hdfs.rollInterval=300

agent.sinks.sink1.hdfs.rollSize=0

agent.sinks.sink1.hdfs.rollCount=0

agent.sinks.sink1.hdfs.callTimeout=100000

agent.sinks.sink1.hdfs.request-timeout=100000

agent.sinks.sink1.hdfs.connect-timeout=80000

agent.sinks.sink1.hdfs.useLocalTimeStamp=true

agent.sinks.sink1.channel = memoryChannel

agent.sinks.sink1.hdfs.filePrefix=ats-

#agent.sinks.k1.hdfs.fileSuffix=.log

node配置：

agent.sources=exec-source

agent.sinks=sink1

agent.channels=memoryChannel

agent.sources.exec-source.type=exec

agent.sources.exec-source.command=tail -F/home/mike/flumelog/tt.log //配置監控文件

agent.sources.exec-source.channels =memoryChannel

agent.channels.memoryChannel.type = memory

agent.channels.memoryChannel.capacity =1000

agent.channels.memoryChannel.keep-alive =1000

agent.channels.memoryChannel.type=file

agent.sinks.sink1.type = avro

agent.sinks.sink1.hostname = 172.28.0.61 //配置接收日誌端地址和端口，也就是master地址

agent.sinks.sink1.port = 23004

agent.sinks.sink1.channel = memoryChannel

#agent.sinks.sink1.rollInterval = 1000

#agent.sinks.hdfs-sink.type=hdfs

#agent.sinks.hdfs-sink.hdfs.path=hdfs://<Host-Nameof name node>/

#agent.sinks.hdfs-sink.hdfs.filePrefix=apacheaccess

#agent.channels.ch1.type=memory

#agent.channels.ch1.capacity=1000

#agent.sources.exec-source.channels=ch1

#agent.sinks.hdfs-sink.channel=ch1

主要參數說明：

agent.sinks.sink1.hdfs.rollInterval=30 （根據時間滾動生成文件）單位秒

agent.sinks.sink1.hdfs.rollSize=0 （根據文件大小滾動生成文件）字節

agent.sinks.sink1.hdfs.rollCount=0 （根據事件數滾動生成文件）條數（比如行數）

hdfs.request-timeout=80000 單位毫秒

agent.sinks.sink1.hdfs.connect-timeout=80000單位毫秒

connect-timeout：Amount oftime (ms) to allow for the first (handshake) request.

connect-timeout：Amount oftime (ms) to allow for requests after the first.（可以設置大點）

Amount of time (ms) to allow for the first(handshake) request.

這三個參數是用在更新文件上，設置多久生成新文件，要生效記得要刪除客戶端的.flume/下面的文件然後重啓客戶端

Timeout時間要設置長一點不然容易報錯

flume在實際環境中的應用：

ats日誌處理：

ats有固定日誌格式，一定時間會生成一個固定格式的文件，這裏定時將這個文件cp一份到flume監控目錄，然後將日誌文件移到備份目錄，方便其它用途，這是原始日誌，這裏解決的flume的日誌收集不能完全實時的問題

注意：

2./bin目錄下flume-ng啓動腳本中的OPTS要設置的大一些，否則會報內存溢出的錯誤。默認是20m，如下：

[html] view plain copy

1. JAVA_OPTS="-Xmx20m"

3.server端的memory channel的capacity和transactionCapacity一定要設置的比client的大，否則會報錯，如下：

2. 1] (org.apache.flume.source.AvroSource.appendBatch:261) - Avro source r1: Unable to process event batch. Exception follows.

3. org.apache.flume.ChannelException: Unable to put batch on required channel: org.apache.flume.channel.MemoryChannel{name: ..}

ERROR hdfs.BucketWriter: Hit max consecutive under-replication rotations (30); will notcontinue rolling files under this path due to under-replication

這個錯誤是在修改參數的時候和剛啓動的時候會生成一些小文件形成的，然後就正常運轉

最新配置，複製寫入到hdfs和文件

#master配置,flume-master.conf

#定義source,channel,sinks

agent.sources = source1

agent.channels = memoryChannel1 memoryChannel2

agent.sinks = sink1 sink2

#sources 參數配置,本機ip地址和端口，根據實際修改

agent.sources.source1.type = avro

agent.sources.source1.selector.type = replicating

agent.sources.source1.bind = *.*.*.*

agent.sources.source1.port = 23004

agent.sources.source1.channels = memoryChannel1 memoryChannel2

#加入時間戳攔截器，要不運行時會報異常

agent.sources.source1.interceptors = i1

agent.sources.source1.interceptors.i1.type = timestamp

#channel配置參數

agent.channels.memoryChannel1.type = memory

agent.channels.memoryChannel1.capacity = 10000

agent.channels.memoryChannel1.transactionCapacity = 10000

agent.channels.memoryChannel1.keep-alive = 1000

agent.channels.memoryChannel2.type = memory

agent.channels.memoryChannel2.capacity = 10000

agent.channels.memoryChannel2.transactionCapacity = 10000

agent.channels.memoryChannel2.keep-alive = 1000

#sinks配置參數,hadoop集羣地址根據需要更改

#agent.sinks.sink1.hdfs.path=hdfs://172.28.0.61:9000/hmbbs/%y-%m-%d/%H%M%S

agent.sinks.sink1.type=hdfs

agent.sinks.sink1.hdfs.path=hdfs://*.*.*.*:9000/sdg

agent.sinks.sink1.hdfs.fileType=DataStream

agent.sinks.sink1.hdfs.writeFormat=TEXT

agent.sinks.sink1.hdfs.round=true

agent.sinks.sink1.hdfs.roundValue=5

agent.sinks.sink1.hdfs.roundUnit=minute

agent.sinks.sink1.hdfs.rollInterval=60

agent.sinks.sink1.hdfs.rollSize=0

agent.sinks.sink1.hdfs.rollCount=0

agent.sinks.sink1.hdfs.callTimeout=100000

agent.sinks.sink1.hdfs.request-timeout=100000

agent.sinks.sink1.hdfs.connect-timeout=80000

agent.sinks.sink1.hdfs.useLocalTimeStamp=true

agent.sinks.sink1.channel = memoryChannel1

agent.sinks.sink1.hdfs.filePrefix=ats-

#agent.sinks.k1.hdfs.fileSuffix=.log

#write to local file

agent.sinks.sink2.type=file_roll

agent.sinks.sink2.channel=memoryChannel2

agent.sinks.sink2.sink.rollInterval=0

#agent.sinks.sink2.sink.serializer=TEXT

#agent.sinks.sink2.sink.batchSize=1000

agent.sinks.sink2.sink.directory=/home/hadoop/atslog/

flume配置和說明

python join()和split()

docker總結

我的友情鏈接

centos7.0的幾個新特性

python基礎知識總結

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結