flume搭建調試

Installing CDH3

https://ccp.cloudera.com/display/CDHDOC/CDH3+Installation

wget http://archive.cloudera.com/redhat/cdh/cloudera-cdh3.repo -O /etc/yum.repos.d/cloudera.repo
 
yum search hadoop
yum -y install hadoop-0.20
 
yum -y install hadoop-0.20-namenode
yum -y install hadoop-0.20-datanode
#yum -y install hadoop-0.20-secondarynamenode
yum -y install hadoop-0.20-jobtracker
yum -y install hadoop-0.20-tasktracker

Installing CDH3 Components

https://ccp.cloudera.com/display/CDHDOC/CDH3+Installation#CDH3Installation-InstallingCDH3Components

yum install

install/Use
---------------------------------
Flume flume
Sqoop sqoop
Hue hue
Pig hadoop-pig
Hive hadoop-hive
HBase hadoop-hbase
ZooKeeper hadoop-zookeeper
Oozie server oozie
Oozie client oozie-client
Whirr whirr
Snappy hadoop-0.20-native
Mahout mahout

flume分爲:

flume 核心
flume.node 作爲節點的服務自啓動腳本
flume.master 作爲maaster的服務自啓動腳本

yum install flume*

[root@flume-hadoop-node-1 ~]# flume
usage: flume command [args...]
commands include: 
  dump            Takes a specified source and dumps to console
  source          Takes a specified source and dumps to console
  node            Start a Flume node/agent (with watchdog)
  master          Start a Flume Master server (with watchdog)
  version         Dump flume build version information 
  node_nowatch    Start a flume node/agent (no watchdog)
  master_nowatch  Start a Flume Master server (no watchdog)
  class <class>   Run specified fully qualified class using Flume environment (no watchdog)
                   ex: flume com.cloudera.flume.agent.FlumeNode 
  classpath       Dump the classpath used by the java executables
  shell           Start the flume shell
  killmaster      Kill a running master
  dumplog         Takes a specified WAL/DFO log file and dumps to console
  sink            Start a one-shot flume node with console source and specified sink 
 
 
 
cd /etc/flume/conf
mv flume-site.xml.template flume-site.xml
vi flume-site.xml
 
#修改masterhost爲你的host
 
 
/etc/init.d/flume-master  start
/etc/init.d/flume-node start

flume文檔

http://archive.cloudera.com/cdh/3/flume/UserGuide/index.html

flume總的來說,是面向流的設計,“source“和”sink"分別代表產生和消費,push、pull都支持,可以擴展支持各種數據源,及數據的處理,非常靈活。

先停掉服務,以前臺模式運行,方便查看各種輸出,直觀的瞭解一把

/etc/init.d/flume-master stop && /etc/init.d/flume-node stop

啓動flume

flume dump console

啓動之後,你可以在輸入任何字符,然後會有來自flume的回顯,因爲我們參數指定了console,這個其實是配置flume的source爲console的輸入,默認sink也是console

source爲文件的情況
flume dump 'text("/etc/services")'

tail文件末尾信息的方法
flume dump 'tail("testfile")'

testfile可以不存在,沒有問題,我們在另外的console裏面創建這個文件,並添加些內容

[root@flume-hadoop-node-1 tmp]# echo "test flume">testfile
[root@flume-hadoop-node-1 tmp]# echo "test flume 123">testfile
[root@flume-hadoop-node-1 tmp]# echo "test flume 123">>testfile
[root@flume-hadoop-node-1 tmp]# echo "test flume 1234">>testfile
[root@flume-hadoop-node-1 tmp]# echo "test flume 12345\r\n123456">>testfile

在flume這邊,就可以實時的看到反饋

2012-01-06 20:42:55,818 [main] INFO agent.LogicalNodeManager: Loading node name with FlumeConfigData: {srcVer:'Thu Jan 01 08:00:00 CST 1970' snkVer:'Thu Jan 01 08:00:00 CST 1970'  ts='Thu Jan 01 08:00:00 CST 1970' flowId:'null' source:'tail( "testfile" )' sink:'console' }
2012-01-06 20:42:55,836 [main] INFO agent.LogicalNode: Node config successfully set to FlumeConfigData: {srcVer:'Thu Jan 01 08:00:00 CST 1970' snkVer:'Thu Jan 01 08:00:00 CST 1970'  ts='Thu Jan 01 08:00:00 CST 1970' flowId:'null' source:'tail( "testfile" )' sink:'console' }
2012-01-06 20:42:55,920 [logicalNode dump-10] INFO debug.ConsoleEventSink: ConsoleEventSink( debug ) opened
2012-01-06 20:42:55,973 [main] INFO agent.FlumeNode: Hadoop Security enabled: false
flume-hadoop-node-1 [INFO Fri Jan 06 20:43:21 CST 2012] { tailSrcFile : (long)8387236824819002469  (string) 'testfile' (double)4.914663849160389E252 } test flume
flume-hadoop-node-1 [INFO Fri Jan 06 20:43:36 CST 2012] { tailSrcFile : (long)8387236824819002469  (string) 'testfile' (double)4.914663849160389E252 } 123
flume-hadoop-node-1 [INFO Fri Jan 06 20:43:48 CST 2012] { tailSrcFile : (long)8387236824819002469  (string) 'testfile' (double)4.914663849160389E252 } test flume 123
flume-hadoop-node-1 [INFO Fri Jan 06 20:43:56 CST 2012] { tailSrcFile : (long)8387236824819002469  (string) 'testfile' (double)4.914663849160389E252 } test flume 1234
flume-hadoop-node-1 [INFO Fri Jan 06 20:44:11 CST 2012] { tailSrcFile : (long)8387236824819002469  (string) 'testfile' (double)4.914663849160389E252 } test flume 12345\\r\\n123456

多個文件,也是可以的

flume dump 'multitail("test1", "test2")'

默認情況下,tail會處理文件的每一行,並分別生成event,默認分隔符是“\n”,並且不會排除分隔符本身,如果你需要自定義分隔符(採用正則表達式),也是可以的,支持
”prev":分隔符屬於前一個event
"next":分隔符屬於下一個event
"exclude":分隔符丟棄

tail("file", delim="\n\n+", delimMode="exclude")
tail("file", delim="</a>", delimMode="prev")

開啓一個UDP服務,並監聽5140端口

 flume dump 'syslogUdp(5140)'

flume web console

http://10.129.8.125:35871/flumemaster.jsp

Cloudera Manager Free Edition

https://ccp.cloudera.com/display/express37/Cloudera+Manager+Free+Edition+Documentation

wget http://archive.cloudera.com/cloudera-manager/installer/latest/cloudera-manager-installer.bin
chmod a+x cloudera-manager-installer.bin 
./cloudera-manager-installer.bin

安裝之前,先禁用Selinux

vi /etc/selinux/config 
--
SELINUX=disabled
--
 
 
setenforce 0

./cloudera-manager-installer.bin
安裝失敗,查看日誌,發現安裝包下載不下來,只能手動下載安裝了。

手動安裝JDK

wget http://archive.cloudera.com/cloudera-manager/redhat/5/x86_64/cloudera-manager/3/RPMS/jdk-6u21-linux-amd64.rpm
rpm -Uhv jdk-6u21-linux-amd64.rpm

http://archive.cloudera.com/cloudera-manager/redhat/5/x86_64/cloudera-manager/3/RPMS/cloudera-manager-daemons-3.7.2.143-1.noarch.rpm

----------------------------華麗的不行了的分割線-----------------------------------------

2臺機器:125 126
125上配置:

vi /etc/flume/conf/flume-site.xml
 
<property>
<name>flume.collector.event.host</name>
<value>collector</value>
<description>This is the host name of the default "remote"     collector.
</description>
</property>
  <property>
<name>flume.collector.port</name>
<value>35853</value>
<description>This default tcp port that the collector listens to     in order to receive events it is collecting.
</description>
</property>

啓動flume各節點

 flume node_watch -n collector

HDFS服務器設置(新配)

hdfs://10.129.8.126/

 
cp  /usr/lib/hadoop/example-confs/conf.pseudo/*  /etc/hadoop/conf/
 
 
mkdir /var/lib/hadoop-0.20/cache/hadoop/dfs/name -p
chmod 777 -R /var/lib/hadoop-0.20/
 
sudo -u hdfs hadoop namenode -format  (注意大寫的:Y)
 
 
 
[root@cloudera-node-1 logs]# hadoop fs -ls hdfs://127.0.0.1/
ls: Wrong FS: hdfs://127.0.0.1/, expected: hdfs://cloudera-node-1
Usage: java FsShell [-ls <path>]
[root@cloudera-node-1 logs]# hadoop fs -ls hdfs://cloudera-node-1
ls: Pathname  from hdfs://cloudera-node-1 is not a valid DFS filename.
Usage: java FsShell [-ls <path>]
[root@cloudera-node-1 logs]# hadoop fs -ls hdfs://cloudera-node-1/
[root@cloudera-node-1 logs]# hadoop fs -mkdir  hdfs://cloudera-node-1/test
[root@cloudera-node-1 logs]# hadoop fs -ls hdfs://cloudera-node-1/
Found 1 items
drwxr-xr-x   - root supergroup          0 2012-02-03 00:54 /test
[root@cloudera-node-1 logs]#

修改hadoop配置,使用外部ip

 
vi /etc/hadoop/conf/core-site.xml
 
<property>
    <name>fs.default.name</name>
    <value>hdfs://10.129.8.126:8020</value>
  </property>
 
/etc/init.d/hadoop-0.20-namenode restart
 
[root@cloudera-node-1 logs]# hadoop fs -ls hdfs://10.129.8.126/
Found 1 items
drwxr-xr-x   - root supergroup          0 2012-02-03 00:54 /test

設置訪問權限:

 
hadoop dfs -chmod 777  hdfs://10.129.8.126/flume/
hadoop dfs -chmod 777  hdfs://10.129.8.126/flume/*

126節點,啓動flume

flume node_nowatch

打開flume master

http://10.129.8.125:35871/flumemaster.jsp

cloudera-node-1 : text("/etc/services") | agentSink("10.129.8.125",35853);
collector : collectorSource(35853) | collectorSink("hdfs://10.129.8.126/flume/","srcdata");

??
Flume’s Tiered Event Sources

collectorSource[(port)]
Collector source. Listens for data from agentSinks forwarding to port port. If port is not specified, the node default collector TCP port, 35853.
!!

hadoop dfs -ls hdfs://10.129.8.126/flume/

125上報錯:

 
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /flume/srcdata20120203-013616957+0800.2438481505068540.00000021.tmp could only be replicated to 0 nodes, instead of 1
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1520)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:665)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)
 
	at org.apache.hadoop.ipc.Client.call(Client.java:1107)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
	at $Proxy6.addBlock(Unknown Source)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
	at $Proxy6.addBlock(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3178)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3047)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1900(DFSClient.java:2305)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2500)

vi /etc/hadoop/conf/hdfs-site.xml
設置replica爲0.也不行

 
<delete>
設置 vi /etc/hadoop/conf/hdfs-site.xml 
<property>
    <name>dfs.thrift.address</name>
    <value>10.129.8.126:10090</value>
</property>
</delete>

vi /etc/hadoop/conf/masters
替換localhost爲ip:10.129.8.126

還是不行,在125上手動執行upload操作

 
vi a.txt
hadoop dfs -put a.txt hdfs://10.129.8.126/flume/srcdata20120203-014405668+0800.2438950215947540.00000019.tmp.1

報一樣的錯誤,

在126上執行如上操作,報同樣錯誤,MD

看來是datanode掛了,但是服務顯示啓動,重啓試試。

 
[root@cloudera-node-1 ~]# /etc/init.d/hadoop-0.20-datanode status
datanode (pid  4866) is running...
[root@cloudera-node-1 ~]# /etc/init.d/hadoop-0.20-datanode restart
Stopping Hadoop datanode daemon (hadoop-datanode): stopping datanode
datanode is stopped                                        [  OK  ]
Starting Hadoop datanode daemon (hadoop-datanode): starting datanode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-datanode-cloudera-node-1.out
datanode (pid  8570) is running...                         [  OK  ]
[root@cloudera-node-1 ~]# vi /usr/lib/hadoop/logs/hadoop-hadoop-datanode-cloudera-node-1.log 
[root@cloudera-node-1 ~]# vi /usr/lib/hadoop/logs/hadoop-hadoop-datanode-cloudera-node-1.log 
[root@cloudera-node-1 ~]# vi /usr/lib/hadoop/logs/hadoop-hadoop-datanode-cloudera-node-1.log 
[root@cloudera-node-1 ~]# vi /usr/lib/hadoop/logs/hadoop-hadoop-datanode-cloudera-node-1.log 
[root@cloudera-node-1 ~]# hadoop dfs -put a.txt hdfs://10.129.8.126/flume/srcdata20120203-014405668+0800.2438950215947540.00000019.tmp.12
put: Target hdfs://10.129.8.126/flume/srcdata20120203-014405668+0800.2438950215947540.00000019.tmp.12 already exists
[root@cloudera-node-1 ~]# hadoop dfs -put a.txt hdfs://10.129.8.126/flume/srcdata20120203-014405668+0800.2438950215947540.00000019.tmp.123
[root@cloudera-node-1 ~]#

ok了。

如果報safemode了

 2012-02-03 01:42:17,467 [logicalNode collector-19] INFO rolling.RollSink: closing RollSink 'escapedCustomDfs("hdfs://10.129.8.126/flume/","srcdata%{rolltag}" )'
2012-02-03 01:42:17,467 [logicalNode collector-19] INFO rolling.RollSink: opening RollSink  'escapedCustomDfs("hdfs://10.129.8.126/flume/","srcdata%{rolltag}" )'
2012-02-03 01:42:17,468 [logicalNode collector-19] INFO debug.InsistentOpenDecorator: Opened MaskDecorator on try 0
2012-02-03 01:42:17,469 [pool-7-thread-1] INFO hdfs.EscapedCustomDfsSink: Opening hdfs://10.129.8.126/flume/srcdata20120203-014217467+0800.2438842015436540.00000019
2012-02-03 01:42:17,476 [logicalNode collector-19] INFO debug.InsistentAppendDecorator: append attempt 3 failed, backoff (8000ms): org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create file/flume/srcdata20120203-014217467+0800.2438842015436540.00000019.tmp. Name node is in safe mode.
The number of live datanodes 0 needs an additional 1 live datanodes to reach the minimum number 1. Safe mode will be turned off automatically.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1182)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1150)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:597)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:576)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)

執行
hadoop dfsadmin -safemode leave

ok,再來一遍

125上面;
flume node_nowatch -n collector

126上面:
flume node_nowatch

ok搞定

[root@flume-hadoop-node-1 log]# hadoop fs -ls hdfs://10.129.8.126/flume/
Found 2 items
-rw-r--r--   3 root supergroup   11829304 2012-02-03 02:17 /flume/srcdata20120203-021531413+0800.2440835961446540.00000021
-rw-r--r--   3 root supergroup          0 2012-02-03 02:18 /flume/srcdata20120203-021605232+0800.2440869780410540.00000023.tmp
[root@flume-hadoop-node-1 log]# hadoop fs -ls hdfs://10.129.8.126/flume/
Found 2 items
-rw-r--r--   3 root supergroup   11829304 2012-02-03 02:17 /flume/srcdata20120203-021531413+0800.2440835961446540.00000021
-rw-r--r--   3 root supergroup    7080210 2012-02-03 02:18 /flume/srcdata20120203-021605232+0800.2440869780410540.00000023
[root@flume-hadoop-node-1 log]# hadoop fs -ls hdfs://10.129.8.126/flume/
Found 2 items
-rw-r--r--   3 root supergroup   11829304 2012-02-03 02:17 /flume/srcdata20120203-021531413+0800.2440835961446540.00000021
-rw-r--r--   3 root supergroup    7080210 2012-02-03 02:18 /flume/srcdata20120203-021605232+0800.2440869780410540.00000023
[root@flume-hadoop-node-1 log]# hadoop fs -tail hdfs://10.129.8.126/flume/srcdata20120203-021531413+0800.2440835961446540.00000021
\t\t3881/udp\t\t\t# Data Acquisition and Control","timestamp":1328205987177,"pri":"INFO","nanos":100778763829457,"host":"cloudera-node-1","fields":{"AckTag":"20120203-020626329+0800.100777916519457.00000019","AckType":"msg","AckChecksum":"\u0000\u0000\u0000\u0000陋賂\u0010婁","rolltag":"20120203-021531413+0800.2440835961446540.00000021"}}
 {"body":"msdts1\t\t3882/tcp\t\t\t# DTS Service Port","timestamp":1328205987177,"pri":"INFO","nanos":100778763863457,"host":"cloudera-node-1","fields":{"AckTag":"20120203-020626329+0800.100777916519457.00000019","AckType":"msg","AckChecksum":"\u0000\u0000\u0000\u0000?隆w?","rolltag":"20120203-021531413+0800.2440835961446540.00000021"}}
 {"body":"msdts1\t\t3882/udp\t\t\t# DTS Service Port","timestamp":1328205987177,"pri":"INFO","nanos":100778763897457,"host":"cloudera-node-1","fields":{"AckTag":"20120203-020626329+0800.100777916519457.00000019","AckType":"msg","AckChecksum":"\u0000\u0000\u0000\u00005=?\u0002","rolltag":"20120203-021531413+0800.2440835961446540.00000021"}}

新加flume node
126上面:

flume node_nowatch -n agentAB

flume-master頁面上面添加配置
agentAB : text("/var/log/dmesg") | agentSink("10.129.8.125",35853);

OK,沒有問題,下面試試默認配置

flume node_nowatch -n agentABC
agentABC : text("/tmp/medcl") | agentSink("10.129.8.125");

這個時候,
node status裏面
agentABC agentABC flume-hadoop-node-1 OPENING Fri Feb 03 02:31:11 CST 2012 3 Fri Feb 03 02:32:49 CST 2012

console端報錯:

 2012-02-03 02:31:14,823 [logicalNode agentABC-22] INFO connector.DirectDriver: Connector logicalNode agentABC-22 exited with error: /tmp/medcl (No such file or directory)
java.io.FileNotFoundException: /tmp/medcl (No such file or directory)
	at java.io.RandomAccessFile.open(Native Method)
	at java.io.RandomAccessFile.<init>(RandomAccessFile.java:212)
	at java.io.RandomAccessFile.<init>(RandomAccessFile.java:98)
	at com.cloudera.flume.handlers.debug.TextFileSource.open(TextFileSource.java:75)
	at com.cloudera.flume.core.connector.DirectDriver$PumperThread.run(DirectDriver.java:87)
Exception in thread "logicalNode agentABC-22" java.lang.NullPointerException
	at com.cloudera.flume.handlers.debug.TextFileSource.close(TextFileSource.java:69)
	at com.cloudera.flume.core.connector.DirectDriver$PumperThread.ensureClosed(DirectDriver.java:183)
	at com.cloudera.flume.core.connector.DirectDriver$PumperThread.errorCleanup(DirectDriver.java:204)
	at com.cloudera.flume.core.connector.DirectDriver$PumperThread.run(DirectDriver.java:92)

創建文件
echo "hello world" > /tmp/medcl

繼續失敗着,不能自動恢復,只能重啓node

 [root@cloudera-node-1 log]# hadoop dfs -tail hdfs://10.129.8.126/flume/srcdata20120203-023644240+0800.2442108787815540.00000021
{"body":"hello world","timestamp":1328207806233,"pri":"INFO","nanos":2442110780978540,"host":"flume-hadoop-node-1","fields":{"AckTag":"20120203-023646196+0800.2442110743929540.00000022","AckType":"msg","AckChecksum":"\u0000\u0000\u0000\u0000\rJ\u0011?","rolltag":"20120203-023644240+0800.2442108787815540.00000021"}}
 
flume node_nowatch -n agentABCD
agentABCD : text("/tmp/medcl") | agentSink("10.129.8.125");

text sink只能執行一次,後續文件有變化,並不處理

tail就可以實現監聽

flume node_nowatch -n collector #如果collector已經關閉,需要重新打開,配置文件在前面
flume node_nowatch -n agentABCDE
agentABCDE : tail("/tmp/medcl") | agentSink("10.129.8.125");

collector每30秒寫一次hadoop,hadoop文件每次新建一個

[root@flume-hadoop-node-1 tmp]# echo "happy new year">>medcl
[root@flume-hadoop-node-1 tmp]# hadoop fs -ls hdfs://10.129.8.126/flume/
Found 7 items
-rw-r--r--   3 root supergroup   11829304 2012-02-03 02:17 /flume/srcdata20120203-021531413+0800.2440835961446540.00000021
-rw-r--r--   3 root supergroup    7080210 2012-02-03 02:18 /flume/srcdata20120203-021605232+0800.2440869780410540.00000023
-rw-r--r--   3 root supergroup     197377 2012-02-03 02:25 /flume/srcdata20120203-022338788+0800.2441323335749540.00000021
-rw-r--r--   3 root supergroup        318 2012-02-03 02:38 /flume/srcdata20120203-023644240+0800.2442108787815540.00000021
-rw-r--r--   3 root supergroup     761621 2012-02-07 19:00 /flume/srcdata20120207-185754757+0800.2846579304755540.00000021
-rw-r--r--   3 root supergroup        336 2012-02-07 19:02 /flume/srcdata20120207-185954947+0800.2846699494856540.00000021
-rw-r--r--   3 root supergroup        329 2012-02-07 19:09 /flume/srcdata20120207-190658071+0800.2847122618653540.00000021
[root@flume-hadoop-node-1 tmp]# 
[root@flume-hadoop-node-1 tmp]# hadoop fs -ls hdfs://10.129.8.126/flume/
Found 8 items
-rw-r--r--   3 root supergroup   11829304 2012-02-03 02:17 /flume/srcdata20120203-021531413+0800.2440835961446540.00000021
-rw-r--r--   3 root supergroup    7080210 2012-02-03 02:18 /flume/srcdata20120203-021605232+0800.2440869780410540.00000023
-rw-r--r--   3 root supergroup     197377 2012-02-03 02:25 /flume/srcdata20120203-022338788+0800.2441323335749540.00000021
-rw-r--r--   3 root supergroup        318 2012-02-03 02:38 /flume/srcdata20120203-023644240+0800.2442108787815540.00000021
-rw-r--r--   3 root supergroup     761621 2012-02-07 19:00 /flume/srcdata20120207-185754757+0800.2846579304755540.00000021
-rw-r--r--   3 root supergroup        336 2012-02-07 19:02 /flume/srcdata20120207-185954947+0800.2846699494856540.00000021
-rw-r--r--   3 root supergroup        329 2012-02-07 19:09 /flume/srcdata20120207-190658071+0800.2847122618653540.00000021
-rw-r--r--   3 root supergroup        337 2012-02-07 19:12 /flume/srcdata20120207-190929343+0800.2847273890577540.00000021
[root@flume-hadoop-node-1 tmp]# hadoop fs -get hdfs://10.129.8.126/flume/srcdata20120207-190929343+0800.2847273890577540.00000021 /tmp/lo2

如果是替換文件內容,不是追加,第一條記錄會造成丟失,此處應該特別注意(bug?)

[root@flume-hadoop-node-1 tmp]# echo "who is your daddy?">medcl
[root@flume-hadoop-node-1 tmp]# hadoop fs -ls hdfs://10.129.8.126/flume/
Found 8 items
-rw-r--r--   3 root supergroup   11829304 2012-02-03 02:17 /flume/srcdata20120203-021531413+0800.2440835961446540.00000021
-rw-r--r--   3 root supergroup    7080210 2012-02-03 02:18 /flume/srcdata20120203-021605232+0800.2440869780410540.00000023
-rw-r--r--   3 root supergroup     197377 2012-02-03 02:25 /flume/srcdata20120203-022338788+0800.2441323335749540.00000021
-rw-r--r--   3 root supergroup        318 2012-02-03 02:38 /flume/srcdata20120203-023644240+0800.2442108787815540.00000021
-rw-r--r--   3 root supergroup     761621 2012-02-07 19:00 /flume/srcdata20120207-185754757+0800.2846579304755540.00000021
-rw-r--r--   3 root supergroup        336 2012-02-07 19:02 /flume/srcdata20120207-185954947+0800.2846699494856540.00000021
-rw-r--r--   3 root supergroup        329 2012-02-07 19:09 /flume/srcdata20120207-190658071+0800.2847122618653540.00000021
-rw-r--r--   3 root supergroup        337 2012-02-07 19:12 /flume/srcdata20120207-190929343+0800.2847273890577540.00000021

再追加一條數據

[root@flume-hadoop-node-1 tmp]# echo "here is a new line">>medcl
[root@flume-hadoop-node-1 tmp]# hadoop fs -ls hdfs://10.129.8.126/flume/
Found 9 items
-rw-r--r--   3 root supergroup   11829304 2012-02-03 02:17 /flume/srcdata20120203-021531413+0800.2440835961446540.00000021
-rw-r--r--   3 root supergroup    7080210 2012-02-03 02:18 /flume/srcdata20120203-021605232+0800.2440869780410540.00000023
-rw-r--r--   3 root supergroup     197377 2012-02-03 02:25 /flume/srcdata20120203-022338788+0800.2441323335749540.00000021
-rw-r--r--   3 root supergroup        318 2012-02-03 02:38 /flume/srcdata20120203-023644240+0800.2442108787815540.00000021
-rw-r--r--   3 root supergroup     761621 2012-02-07 19:00 /flume/srcdata20120207-185754757+0800.2846579304755540.00000021
-rw-r--r--   3 root supergroup        336 2012-02-07 19:02 /flume/srcdata20120207-185954947+0800.2846699494856540.00000021
-rw-r--r--   3 root supergroup        329 2012-02-07 19:09 /flume/srcdata20120207-190658071+0800.2847122618653540.00000021
-rw-r--r--   3 root supergroup        337 2012-02-07 19:12 /flume/srcdata20120207-190929343+0800.2847273890577540.00000021
-rw-r--r--   3 root supergroup          0 2012-02-07 19:19 /flume/srcdata20120207-191702865+0800.2847727413000540.00000021.tmp
 
[root@flume-hadoop-node-1 tmp]# hadoop fs -ls hdfs://10.129.8.126/flume/
Found 9 items
-rw-r--r--   3 root supergroup   11829304 2012-02-03 02:17 /flume/srcdata20120203-021531413+0800.2440835961446540.00000021
-rw-r--r--   3 root supergroup    7080210 2012-02-03 02:18 /flume/srcdata20120203-021605232+0800.2440869780410540.00000023
-rw-r--r--   3 root supergroup     197377 2012-02-03 02:25 /flume/srcdata20120203-022338788+0800.2441323335749540.00000021
-rw-r--r--   3 root supergroup        318 2012-02-03 02:38 /flume/srcdata20120203-023644240+0800.2442108787815540.00000021
-rw-r--r--   3 root supergroup     761621 2012-02-07 19:00 /flume/srcdata20120207-185754757+0800.2846579304755540.00000021
-rw-r--r--   3 root supergroup        336 2012-02-07 19:02 /flume/srcdata20120207-185954947+0800.2846699494856540.00000021
-rw-r--r--   3 root supergroup        329 2012-02-07 19:09 /flume/srcdata20120207-190658071+0800.2847122618653540.00000021
-rw-r--r--   3 root supergroup        337 2012-02-07 19:12 /flume/srcdata20120207-190929343+0800.2847273890577540.00000021
-rw-r--r--   3 root supergroup        341 2012-02-07 19:19 /flume/srcdata20120207-191702865+0800.2847727413000540.00000021
[root@flume-hadoop-node-1 tmp]# hadoop fs -tail hdfs://10.129.8.126/flume/srcdata20120207-191702865+0800.2847727413000540.00000021
{"body":"here is a new line","timestamp":1328613446703,"pri":"INFO","nanos":2847751251273540,"host":"flume-hadoop-node-1","fields":{"AckTag":"20120207-191720960+0800.2847745508415540.00000025","AckType":"msg","AckChecksum":"\u0000\u0000\u0000\u0000/rN?","tailSrcFile":"medcl","rolltag":"20120207-191702865+0800.2847727413000540.00000021"}}

果然,數據丟了一條了。

ok,前面提到了flume使用3種工作模式來保證數據的可靠性與可用性:
1.End2End,2端確認,失敗會自動重試(重試次數多少,重試失敗之後怎樣處理,還要繼續研究)
agentE2ESink[("machine"[,port])]

2.DiskFailover,失敗寫本地磁盤,週期性檢查,collector可用的時候,自動重做任務。
agentDFOSink[("machine"[,port])]

3.高效模式,collector失敗就丟棄日誌,夠狠夠絕
agentBESink[("machine"[,port])]

前面使用到的agentSink,是第一種End2End的別名,效果和End2End一樣。

多收集器的配置

多個collector能夠提高吞吐量,因爲日誌收集都是平行,前面提到過,爲保證可靠性,如果collector掛了,agent需要寫本地磁盤,然後週期性的去重新連接collector,另外,日誌收集停止了,後面的日誌處理與分析也歇菜了,這個可不行的。
多個collector就可以解決這個問題,汗!

另外多個collector中,如果其中一個掛了,agent應該是能夠自動切換的,怎麼配呢?

使用failover chains,

agentA : src | agentE2EChain("collectorA:35853","collectorB:35853");
agentB : src | agentE2EChain("collectorA:35853","collectorC:35853");
agentC : src | agentE2EChain("collectorB:35853","collectorA:35853");
agentD : src | agentE2EChain("collectorB:35853","collectorC:35853");
agentE : src | agentE2EChain("collectorC:35853","collectorA:35853");
agentF : src | agentE2EChain("collectorC:35853","collectorB:35853");
collectorA : collectorSource(35853) | collectorSink("hdfs://...","src");
collectorB : collectorSource(35853) | collectorSink("hdfs://...","src");
collectorC : collectorSource(35853) | collectorSink("hdfs://...","src");

如上配置,chain指定了2個,第一個collector失敗了之後,自動切換使用第二個。

自動FailoverChain,主要是通過使用特殊的source和sink名字(多master下不適用)

source使用:
autoCollectorSource

sink使用:
autoE2EChain, autoDFOChain, or autoBEChain

配置爲:
agentA : src | autoE2EChain ;
agentB : src | autoE2EChain ;
agentC : src | autoE2EChain ;
agentD : src | autoE2EChain ;
agentE : src | autoE2EChain ;
agentF : src | autoE2EChain ;
collectorA : autoCollectorSource | collectorSink("hdfs://...", "src");
collectorB : autoCollectorSource | collectorSink("hdfs://...", "src");
collectorC : autoCollectorSource | collectorSink("hdfs://...", "src");

Logical Configurations
一個physical node包含若干個logical node,logical node又分爲:logical sources 和logical sinks ,使用flow來隔離nodes和分組

logical node允許一個JVM實例包含多個logical nodes,實現在一個JVM上跑多個Source和Sink的線程。

每個logical node的名稱必須唯一,包括physical node 名稱或者 host名稱都不能相同

logical定義分兩步,

1.定義node類型
agent1 : _source_ | autoBEChain ;
collector1 : autoCollectorSource | collectorSink("hdfs://....") ;

2.mapping logical node和 physical node
map host1 agent1
map host2 collector1

3.解除一個logical節點
decommission agent1

試試

1251004  cd /tmp/
 1005  ls
 1006  rm -rif flume-*
 1007  /etc/init.d/flume-master restart
 1008  /etc/init.d/flume-node star

126上

/etc/init.d/flume-node star

flume master頁面
config:

agent1 : tail("/tmp/medcl") | autoBEChain ;
collector1 : autoCollectorSource | collectorSink("hdfs://10.129.8.126/flume/","medcl") ;

注:主機名-ip
cloudera-node-1:10.129.8.126
flume-hadoop-node-1:10.129.8.125

raw command:

command: map
arguments:10.129.8.125 agent1
#flume-hadoop-node-1 agent1
 
command: map 
arguments: 10.129.8.126 collector1
#cloudera-node-1 collector1
試試解除
map 10.129.8.125 agent2
 
decommission  agent2

(注意空格,decommission兩端不能有空格)

或者unmap和map操作來移動logicalnode

unmap host2 collector1
map host3 collector1


抓包得到請求爲:
curl -XPOST http://10.129.8.125:35871/mastersubmit.jsp -d'cmd=unmap&args=10.129.8.125+agent1'

注:logical sources和logical sinks在多master下不適用

通過logical source和logical sink可以在不知道具體物理節點的時候就進行流程的配置,flume有一種翻譯的機制,會自動將logical節點名稱替換成實際的主機名和端口
事實上,autoSinks和auto-Chain也是這樣來實現的。

Flow 隔離,(注,多master下也不適用,悲催啊)

假設你需要收集一個物理機的多種數據,並存放到不同的地方,一種方式是對所有的數據打上tag,通過同一個管道來傳數據,然後通過後處理來分離數據

另一種是在整個傳輸過程中通過將兩兩種數據隔離,避免後處理的產生

Flume兩種都支持,並且延時很低,通過引入flow的概念,將節點進行分組,配置方式如下:
flume master頁面:
raw commands

命令:config
參數:[logincal node] [flow name] fooSrc autoBEChain

實際例子:

config AgentC myflow tail("/tmp/medcl") autoBEChain
config CollectorC myflow autoCollectorSource collectorSink("hdfs://10.129.8.126/flume/","medcl_flow")
 
map 10.129.8.125 AgentC 
map 10.129.8.126 CollectorC

!!!!

------------
1.問題:
fail( "logical node not mapped to physical node yet" )

1.使用主機名來做map,node status顯示的是什麼名稱,map的時候就用什麼名稱
2.先map好logical node,然後再更新config配置

正常工作的配置,

map cloudera-node-1  agent1
map flume-hadoop-node-1 collector1
 
agent1 : tail("/tmp/medcl") | agentSink("10.129.8.125",35853);
collector1 : collectorSource(35853) | collectorSink("hdfs://10.129.8.126/flume/","medcl");

!!!!

多master配置
多master之間自動同步,一個master掛了,其下node會自動轉移到其他master上去。

flume master有兩種工作模式:standalone和distributed
如何配置呢?

<property>
<name>flume.master.servers</name>
<value>hostA,hostB</value>
</property>

一個Host則是standalone模式,多個host即distributed模式【分佈式模式下,每個master的配置文件必須一樣】
另外,每個master必須要配置不同的serverid,如下:

MaserA:
<property>
<name>flume.master.serverid</name>
<value>0</value>
</property>
MasterB:
<property>
<name>flume.master.serverid</name>
<value>1</value>
</property>

【數字和前面配置的服務器列表的下標保持一致即可】
分佈式環境下,至少需要3臺服務器來保證允許一臺失敗,如果要允許同時兩臺掛掉,則至少需要5臺服務器
,如果master節點存活率不能超過總數的一半,整個flume master 集羣就會block住,無法讀寫配置信息

flume master存放配置信息的地方叫做:configuration store,支持插拔,本身支持兩種實現:
基於內存的:MBCS和基於ZooKeeper的:ZBCS
默認ZBCS,flume內置zookeeper,支持配置到現有的zookeeper集羣去

<property>
<name>flume.master.store</name>
<value>zookeeper</value>
</property>

【value值可選:zookeeper或者memory】

ZBCS配置

flume.master.zk.logdir:存儲配置文件信息,更新日誌,失敗信息等
flume.master.zk.server.quorum.port:默認3182,zookeeper server本地監聽
flume.master.zk.server.election.port:默認3183,zookeeper server用來尋找其它節點
flume.master.zk.client.port:默認3181,用來與zookeeper server通訊

FlumeMaster的gossip協議支持:

<property>
<name>flume.master.gossip.port</name>
<value>57890</value>
</property>

分佈式模型下,flume node的配置也需要調整,從連一個改成連接多個master

<property>
<name>flume.master.servers</name>
<value>masterA,masterB,masterC</value>
</property>

flume node通過定期與master的端口做心跳檢測,一旦master 連接失敗,自動隨機切換到剩下的可以連上的master上去。【master節點通過配置flume.master.heartbeat.port來配置心跳端口】

如果要使用外部的zookeeper,配置如下
conf/flume-site.xml.

<property>
  <name>flume.master.zk.use.external</name>
  <value>true</value>
</property>
 
<property>
  <name>flume.master.zk.servers</name>
  <value>zkServerA:2181,zkServerB:2181,zkServerC:2181</value>
</property>

Flume與數據源集成
Flume強大就在於靈活,支持各種數據源,結構化的,非結構化的,半結構化等等
三種方式:
pushing、polling、embedding(嵌入flume組件到你的應用程序中)

Push Sources:
syslogTcp,syslogUdp:syslog,syslog-ng日誌協議
scribe:scribe日誌系統的協議

Polling:
tail,mulitail:監視文件內容的追加信息
exec:適合從現有系統抽取數據
poller:收集來着flume node本身的信息

Flume Event的數據模型
6個主要的字段;
Unix timestamp
Nanosecond timestamp 【納秒級別的時間戳】
Priority
Source host
Body
Metadata table with an arbitrary number of attribute value pairs.

所有的event都有這幾個字段,不過body長度可能爲0,metadata表可能爲空。

priority :TRACE, DEBUG, INFO, WARN, ERROR, or FATAL,這幾種
body:raw格式,默認最大32KB,多餘的截掉,通過參數flume.event.max.size.bytes來進行配置

使用event的字段來自定義輸出位置
collectorSink("hdfs://namenode/flume/webdata/%H00/", "%{host}-")
%H 爲時間timestamp字段裏的小時,host爲field裏面的主機名

快速參考:
[horizontal] %{host}
host
%{nanos}
nanos
%{priority}
priority string
%{body}
body
%%
a % character.
%t
Unix time in millis

時間比較特殊,直接使用,不需要{}
collectorSink("hdfs://namenode/flume/webdata/%Y-%m-%d/%H00/", "web-")

快速參考:
%a

locale’s short weekday name (Mon, Tue, …)

%A

locale’s full weekday name (Monday, Tuesday, …)

%b

locale’s short month name (Jan, Feb,…)

%B

locale’s long month name (January, February,…)

%c

locale’s date and time (Thu Mar 3 23:05:25 2005)

%d

day of month (01)

%D

date; same as %m/%d/%y

%H

hour (00..23)

%I

hour (01..12)

%j

day of year (001..366)

%k

hour ( 0..23)

%l

hour ( 1..12)

%m

month (01..12)

%M

minute (00..59)

%P

locale’s equivalent of am or pm

%s

seconds since 1970-01-01 00:00:00 UTC

%S

second (00..60)

%y

last two digits of year (00..99)

%Y

year (2010)

%z

+hhmm numeric timezone (for example, -0400)

輸出文件格式

兩種方式:
一直是在 flume-site.xml裏面設置默認值,另外是由特定的sink來決定

1.flume-site.xml
flume.collector.output.format

格式快速參考

avro

Avro Native file format. Default currently is uncompressed.

avrodata

Binary encoded data written in the avro binary format.

avrojson

JSON encoded data generated by avro.

default

a debugging format.

json

JSON encoded data.

log4j

a log4j pattern similar to that used by CDH output pattern.

raw

Event body only. This is most similar to copying a file but does not preserve any uniqifying metadata like host/timestamp/nanos.

syslog

a syslog like text output format.

seqfile

the binary hadoop Sequence file format with WritableEventKeys keys, and WritableEvent as values.

2.分別配置

collectorSink( "dfsdir","prefix"[, rollmillis[, format]])
text("file"[,format])
formatDfs("hdfs://nn/file" [, format])
escapedFormatDfs("hdfs://nn/file" [, format])

壓縮seqfile
formatDfs("hdfs://nn/dir/file", seqfile("bzip2"))

HDFS大量小文件與高延遲的處理
Flume兩種策略來處理
1.合併小文件到大的文件
2.使用CombinedFileInputFormat

<property>
    <name>flume.collector.dfs.compress.codec</name>
    <value>None</value>
    <description>Writes formatted data compressed in specified codec to
    dfs. Value is None, GzipCodec, DefaultCodec (deflate), BZip2Codec,
    or any other Codec Hadoop is aware of </description>
  </property>

seqfile和avrodata支持內部的壓縮,具體再研究

DataFlow定義語言

Fan out,往所有sinks寫:
[ console, collectorSink ]

Fail over,當前失敗,轉移到下一個,嘗試候選sink:

< logicalSink("collector1") ? logicalSink("collector2") >

配置樣例:

agent1 : source | < logicalSink("collector1") ? logicalSink("collector2") > ;

Roll sink,每隔一段時間,關閉當前實例,創建新的實例,每次會創建新的獨立的文件:
roll(millis) sink
配置樣例:

roll(1000) [ console, escapedCustomDfs("hdfs://namenode/flume/file-%{rolltag}") ]

Sink Decorators,sink裝飾器
Fan out和Failover影響messages去哪裏,但不修改數據,如果要過濾數據什麼的,使用sink decorator

sink decorator可以做很多事情,如可以給數據流添加屬性,可以通過寫ahead 日誌來確保可靠性,或者通過批量、壓縮來提供網絡吞吐,抽樣甚至輕量級的分析

flumenode: source | intervalSampler(10) sink;
flumenode: source | batch(100) sink;
flumenode: source | batch(100) gzip sink;
collector(15000) { escapedCustomDfs("xxx","yyy-%{rolltag}") }
collector(15000) { [ escapedCustomDfs("xxx","yyy-%{rolltag}"), hbase("aaa", "bbb-%{rolltag}"), elasticSearch("eeee","ffff") ] } 【同時往3個sink裏面寫數據,可能有些是持久化的,有些是瞬時的,都成功之後,纔會確認成功】

node1 : tail("foo") | ackedWriteAhead batch(100) gzip lazyOpen stubbornAppend logicalSink("bar");【write ahead,批量100,gzip壓縮】

Metadata支持正則來進行抽取
支持類似select語法來篩選

thriftSink and thriftSource

擴展與插件

http://archive.cloudera.com/cdh/3/flume/UserGuide/index.html#_semantics_of_flume_extensions

附錄真是好啊

http://archive.cloudera.com/cdh/3/flume/UserGuide/index.html#_flume_source_catalog

 
map cloudera-node-1 agent2
agent2 : syslogTcp(2012) | agentSink("10.129.8.125",35853);
 
flume node_nowatch -n medcl
agent2 : syslogTcp(2012) | agentSink("10.129.8.125",35853);

測試syslog信息

1.NC連接
nc 10.129.8.126 2012
 
2.輸入syslog消息(遵照格式:http://blog.csdn.net/xcj0535/article/details/4158624<165>Aug 24 05:34:00 CST 1987 mymachine myproc[10]: %% It's
time to make the do-nuts. %% Ingredients: Mix=OK, Jelly=OK #
Devices: Mixer=OK, Jelly_Injector=OK, Frier=OK # Transport:
Conveyer1=OK, Conveyer2=OK # %%
 
<1> medcl is back
 
 
syslog的格式 
 
下面是一個syslog消息:
<30>Oct 9 22:33:20 hlfedora auditd[1787]: The audit daemon is exiting.
其中“<30>”是PRI部分,“Oct 9 22:33:20 hlfedora”是HEADER部分,“auditd[1787]: The audit daemon is exiting.”是MSG部分。
 
 
 
[root@cloudera-node-1 ~]# hadoop fs -cat /flume/medcl20120209-221925655+0800.3031470203471540.00000026
{"body":"medcl is back","timestamp":1328797314800,"pri":"INFO","nanos":692106386851457,"host":"cloudera-node-1","fields":{"AckTag":"20120209-222148285+0800.692099872659457.00000037","syslogfacility":"\u0001","AckType":"msg","AckChecksum":"\u0000\u0000\u0000\u0000qu錨茫","syslogseverity":"\u0003","rolltag":"20120209-221925655+0800.3031470203471540.00000026"}}
 
 
upload到HDFS的文件包含了太多內容
raw下
 
collector2 : syslogTcp( 2013)	 | collectorSink( "hdfs://10.129.8.126/flume/", "medcl_raw",3000,raw  );
 
 
C:\Windows\system32>nc 10.129.8.125 2013
<1> i will be back
<1> i will be back2
<1> i will be back3
<1> i will be back4
 
[root@cloudera-node-1 ~]# hadoop fs -cat /flume/medcl_raw20120209-235000888+0800.3036905435701540.00000069
 i will be back
 i will be back2
 i will be back3
 i will be back4

.NET Agent 25個線程,結果壓趴下了[另外後續測試發現經常無原因socket斷開,服務端socket直接掛掉,flume顯示error]。
2012-02-10 21:29:44,154 ERROR com.cloudera.flume.core.connector.DirectDriver: Exiting driver logicalNode collector2-20 in error state SyslogTcpSourceThreads | Collector because null

syslogTcp不穩定,果斷換thriftRpc作爲Source,經測果然很穩定

 
thrift-0.6.0.exe -r -gen csharp flume.thrift
 
2012-02-13 23:36:30,574 [pool-4-thread-1] ERROR server.TSaneThreadPoolServer: Thrift error occurred during processing of message.
org.apache.thrift.protocol.TProtocolException: Missing version in readMessageBegin, old client?
	at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:213)
	at com.cloudera.flume.handlers.thrift.ThriftFlumeEventServer$Processor.process(ThriftFlumeEventServer.java:224)
	at org.apache.thrift.server.TSaneThreadPoolServer$WorkerProcess.run(TSaneThreadPoolServer.java:280)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:619)

此異常可能是因爲服務端和客戶端使用了不相同的transport,如framed和buffered不匹配

 
collector3 : thriftSource( 2014 )| collectorSink( "hdfs://10.129.8.126/flume/", "medcl_thrift",60000,raw  );
collector4 : thriftSource( 2015 )| collectorSink( "hdfs://10.129.8.126/flume/", "medcl_thrift",60000,raw  );
collector5 : thriftSource( 2016 )| collectorSink( "hdfs://10.129.8.126/flume/", "medcl_thrift",60000,raw  );
collector6 : thriftSource( 2017 )| collectorSink( "hdfs://10.129.8.126/flume/", "medcl_thrift",60000,raw  );
collector7 : thriftSource( 2018 )| collectorSink( "hdfs://10.129.8.126/flume/", "medcl_thrift2",30000);
 
map cloudera-node-1	 collector7

vi flume-site.xml,添加壓縮和默認roll時間

<property>
    <name>flume.collector.dfs.compress.gzip</name>
    <value>true</value>
    <description>Writes compressed output in gzip format to dfs. value is
     boolean type, i.e. true/false</description>
  </property>
 
<property>
    <name>flume.collector.roll.millis</name>
    <value>60000</value>
    <description>The time (in milliseconds)
    between when hdfs files are closed and a new file is opened
    (rolled).
    </description>
  </property>

測試文件模板

collector8 : thriftSource( 2019 )| collectorSink("hdfs://10.129.8.126/flume/app/%{host}/%Y-%m-%d/", "%H%M%S-test1-%t",5000);
map cloudera-node-1	 collector8
 
[root@flume-hadoop-node-1 ~]# hadoop fs -lsr hdfs://10.129.8.126/flume/app
drwxr-xr-x   - flume supergroup          0 2012-02-17 00:49 /flume/app/MEDCL-THINK
drwxr-xr-x   - flume supergroup          0 2012-02-17 00:49 /flume/app/MEDCL-THINK/4113221-02-12
-rw-r--r--   1 flume supergroup        219 2012-02-17 00:49 /flume/app/MEDCL-THINK/4113221-02-12/203942-test1-12973855419598268720120217-004946767+0800.1305778353827457.00006891

更新

 
collector8 : thriftSource( 2019 )| collectorSink("hdfs://10.129.8.126/flume/%{catalog}/2012-%m/%d/", "%a-%{host}-",5000,raw());

結果:

/flume/FileTemplateRaw/2012-11/19/Fri-MEDCL-THINK-20120217-013005302+0800.1308196889416457.00007109
collector8 : thriftSource( 2019 )| collectorSink("hdfs://10.129.8.126/flume/%{catalog}/2012", "",5000,raw());

本文來自: flume搭建調試

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章