flume學習（三）：Flume Interceptors的使用

原創

xinlangtianxia

2020-02-20 22:29

問題導讀
1、如何理解flume攔截器？
2、如何使用regex_filter和 timestamp這兩個攔截器來實現一個較強的功能？
3、怎樣爲source1添加了兩個攔截器？

對於flume攔截器,我的理解是：在app(應用程序日誌)和 source 之間的，對app日誌進行攔截處理的。也即在日誌進入到source之前，對日誌進行一些包裝、清新過濾等等動作。

官方上提供的已有的攔截器有：

Timestamp Interceptor
Host Interceptor
Static Interceptor
Regex Filtering Interceptor
Regex Extractor Interceptor

複製代碼

像很多java的開源項目如springmvc中的攔截器一樣，flume的攔截器也是chain形式的，可以對一個source指定多個攔截器，按先後順序依次處理。
Timestamp Interceptor :在event的header中添加一個key叫：timestamp,value爲當前的時間戳。這個攔截器在sink爲hdfs 時很有用，後面會舉例說到
Host Interceptor：在event的header中添加一個key叫：host,value爲當前機器的hostname或者ip。
Static Interceptor:可以在event的header中添加自定義的key和value。
Regex Filtering Interceptor:通過正則來清洗或包含匹配的events。
Regex Extractor Interceptor：通過正則表達式來在header中添加指定的key,value則爲正則匹配的部分

下面舉例說明這些攔截器的用法，首先我們調整一下第一篇文章中的那個WriteLog類：

public class WriteLog {
protected static final Log logger = LogFactory.getLog(WriteLog.class);
/**
* @param args
* @throws InterruptedException
*/
public static void main(String[] args) throws InterruptedException {
// TODO Auto-generated method stub
while (true) {
logger.info(new Date().getTime());
logger.info("{\"requestTime\":"
+ System.currentTimeMillis()
+ ",\"requestParams\":{\"timestamp\":1405499314238,\"phone\":\"02038824941\",\"cardName\":\"測試商家名稱\",\"provinceCode\":\"440000\",\"cityCode\":\"440106\"},\"requestUrl\":\"/reporter-api/reporter/reporter12/init.do\"}");
Thread.sleep(2000);
}
}
}

複製代碼

又多輸出了一行日誌信息，現在每次循環都會輸出兩行日誌信息，第一行是一個時間戳信息，第二行是一行JSON格式的字符串信息。

接下來我們用regex_filter和 timestamp這兩個攔截器來實現這樣一個功能：
1 過濾掉LOG4J輸出的第一行那個時間戳日誌信息，只收集JSON格式的日誌信息
2 將收集的日誌信息保存到HDFS上，每天的日誌保存到以該天命名的目錄下面，如2014-7-25號的日誌，保存到/flume/events/14-07-25目錄下面。

修改後的flume.conf如下：

tier1.sources=source1
tier1.channels=channel1
tier1.sinks=sink1
tier1.sources.source1.type=avro
tier1.sources.source1.bind=0.0.0.0
tier1.sources.source1.port=44444
tier1.sources.source1.channels=channel1
tier1.sources.source1.interceptors=i1 i2
tier1.sources.source1.interceptors.i1.type=regex_filter
tier1.sources.source1.interceptors.i1.regex=\\{.*\\}
tier1.sources.source1.interceptors.i2.type=timestamp
tier1.channels.channel1.type=memory
tier1.channels.channel1.capacity=10000
tier1.channels.channel1.transactionCapacity=1000
tier1.channels.channel1.keep-alive=30
tier1.sinks.sink1.type=hdfs
tier1.sinks.sink1.channel=channel1
tier1.sinks.sink1.hdfs.path=hdfs://master68:8020/flume/events/%y-%m-%d
tier1.sinks.sink1.hdfs.fileType=DataStream
tier1.sinks.sink1.hdfs.writeFormat=Text
tier1.sinks.sink1.hdfs.rollInterval=0
tier1.sinks.sink1.hdfs.rollSize=10240
tier1.sinks.sink1.hdfs.rollCount=0
tier1.sinks.sink1.hdfs.idleTimeout=60

複製代碼

我們對source1添加了兩個攔截器i1和i2,i1爲regex_filter，過濾的正則爲\\{.*\\},注意正則的寫法用到了轉義字符，不然source1無法啓動，會報錯。
i2爲timestamp，在header中添加了一個timestamp的key,然後我們修改了sink1.hdfs.path在後面加上了/%y-%m-%d這一串字符，這一串字符要求event的header中必須有timestamp這個key,這就是爲什麼我們需要添加一個timestamp攔截器的原因，如果不添加這個攔截器，無法使用這樣的佔位符，會報錯。還有很多佔位符，請參考官方文檔。

然後運行WriteLog,去hdfs上查看對應目錄下面的文件，會發現內容只有JSON字符串的日誌，與我們的功能描述一致。

站內首發文章

xinlangtianxia

發佈了9 篇原創文章 · 獲贊 2 · 訪問量 4萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

flume學習（三）：Flume Interceptors的使用

如何使用 JS 判斷用戶是否處於活躍狀態

lightdb秒級增加列和刪除列（not null帶默認值）

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

flume學習（五）：flume將log4j日誌數據寫入到hdfs

flume學習（七）、（八）：如何使用event header中的key值以及自定義source

flume學習（九）：自定義攔截器

flume學習（四）：Flume Channel Selectors使用

flume學習（十）：使用Morphline Interceptor

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結