Flume對接Hive（Sink）遇到的坑，以及最終放棄hive選用hdfs。歡迎討論指點

項目中打算使用Flume把數據直接傳到Hive表而不是HDFS上，使用Hive作爲Sink，Flume版本爲1.9.0。

前期啓動遇到各種報錯：

NoClassDefFoundError: org/apache/hadoop/hive/ql/session/SessionState

NoClassDefFoundError: org/apache/hadoop/hive/cli/CliSessionState

NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf

NoClassDefFoundError: org/apache/hadoop/conf/Configuration

NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf

java.lang.NoClassDefFoundError: com/esotericsoftware/kryo/Serializer

java.lang.ClassNotFoundException: com.esotericsoftware.kryo.Serializer

NoClassDefFoundError: org/antlr/runtime/RecognitionException

解決：

將相應的jar包一股腦拷過去：
例如，CDH中的jar包目錄是：

/data/cloudera/parcels/CDH-5.11.2-1.cdh5.11.2.p0.4/jars

進入到目錄後：

scp hive-* [email protected]:/usr/local/flume/lib
scp hadoop-* [email protected]:/usr/local/flume/lib
scp antlr-* [email protected]:/usr/local/flume/lib
scp kryo-2.22.jar [email protected]:/usr/local/flume/lib

設置flume配置文件：

# example.conf: A single-node Flume configuration

# Name the components on this agent
video_hive.sources = r1
video_hive.sinks = k1
video_hive.channels = c1

# Describe/configure the source
video_hive.sources.r1.type = netcat
video_hive.sources.r1.bind = localhost
video_hive.sources.r1.port = 44444

# Describe the sink
video_hive.sinks.k1.type = hive
video_hive.sinks.k1.channel = c1
video_hive.sinks.k1.hive.metastore = thrift://dev07.hadoop.openpf:9083
#video_hive.sinks.k1.hive.metastore = thrift://172.28.23.21:9083
video_hive.sinks.k1.hive.database = recommend_video
video_hive.sinks.k1.hive.table = video_test
#video_hive.sinks.k1.hive.table = user_video_action_log
video_hive.sinks.k1.hive.partition = %Y-%m-%d
#video_hive.sinks.k1.autoCreatePartitions = false
video_hive.sinks.k1.useLocalTimeStamp = true
video_hive.sinks.k1.batchSize = 1500
#video_hive.sinks.k1.round = true
#video_hive.sinks.k1.roundValue = 10
#video_hive.sinks.k1.roundUnit = minute
video_hive.sinks.k1.serializer = DELIMITED
video_hive.sinks.k1.serializer.delimiter = ","
video_hive.sinks.k1.serializer.serdeSeparator = ','
video_hive.sinks.k1.serializer.fieldnames = timestamp,userid,videoid,totaltime,playtime,hits,rate,praise

# Use a channel which buffers events in memory
video_hive.channels.c1.type = memory
video_hive.channels.c1.capacity = 2000
video_hive.channels.c1.transactionCapacity = 1500

# Bind the source and sink to the channel
video_hive.sources.r1.channels = c1
video_hive.sinks.k1.channel = c1

注意建表語句（clustered分桶、transactional事務、orc存儲格式）：

create table if not exists video_test ( `timestamp` string, 
  `userid` string, 
  `videoid` string, 
  `totaltime` string, 
  `playtime` string, 
  `rate` string, 
  `hits` string, 
  `praise` string ) COMMENT '用戶視頻日誌'
  partitioned by (date string)
  clustered by (userid) into 5 buckets
  row format delimited fields terminated by ','
  stored as orc
  tblproperties("transactional"='true');

終極大招：

最後還不行的話，將hdfs-site.xml，hive-conf.properties，hive-site.xml，hive-env.sh拷到flume的配置文件下。如：/usr/local/flume/conf

大功搞成！！

附上啓動flume後臺的腳本編寫：

#!/bin/sh

FLUME_HOME=/usr/local/flume

#
#nohup java ${JAVA_OPT} -jar ${APPLICATION_JAR} --spring.profiles.active=dev --spring.config.location="config/" >/dev/null 2>&1 &
nohup flume-ng agent --conf ${FLUME_HOME}/conf --conf-file ${FLUME_HOME}/conf/flume-hive.properties --name video_hive -Dflume.root.logger=INFO,LOGFILE &

重點：hive問題點：

flume對接的hive表必須要求是orc格式存儲的，通過日誌文件讀取文檔到hive表後，使用where語句查詢會有問題（不用條件篩選查詢正常）。而且沒有找到設置其他存儲格式的配置方法，所以最終放棄hive，轉向hdfs。。。

附查詢錯誤截圖：

如果flume直接對接textfile格式的hive表，也會有問題：

總體感覺，flume對接hive方案還不成熟，存在各種小問題等。

歡迎各路大神討論交流！！！

------------------------------------------------------------------------------------------------------------------------------------------------------------

flume對接hdfs：

# example.conf: A single-node Flume configuration

# Name the components on this agent
video_hdfs.sources = s1
video_hdfs.sinks = k1
video_hdfs.channels = c1

# Describe/configure the source
##netcat
#video_hdfs.sources.s1.type = netcat
#video_hdfs.sources.s1.bind = localhost
#video_hdfs.sources.s1.port = 44444
##exec
#video_hdfs.sources = s1
#video_hdfs.sources.s1.type = exec
#video_hdfs.sources.s1.command = tail -F /home/recommend/recom-video/logs/user-video.log
#video_hdfs.sources.s1.channels = c1

##TAILDIR
video_hdfs.sources.s1.type = TAILDIR
# 元數據位置
video_hdfs.sources.s1.positionFile = /usr/local/flume/conf/taildir_position.json
# 監控的目錄
video_hdfs.sources.s1.filegroups = f1
video_hdfs.sources.s1.filegroups.f1 = /home/recommend/recom-video/logs/user-video.log
video_hdfs.sources.s1.fileHeader = true


# Describe the sink
video_hdfs.sinks.k1.type = hdfs
video_hdfs.sinks.k1.channel = c1
video_hdfs.sinks.k1.type = hdfs
video_hdfs.sinks.k1.hdfs.path = hdfs://nameservice1/user/hive/warehouse/recommend_video.db/video_test/dayid=%Y%m%d
video_hdfs.sinks.k1.hdfs.fileType = DataStream
video_hdfs.sinks.k1.hdfs.writeFormat=TEXT
video_hdfs.sinks.k1.hdfs.filePrefix = events-
video_hdfs.sinks.k1.hdfs.fileSuffix = .log
video_hdfs.sinks.k1.hdfs.useLocalTimeStamp = true
video_hdfs.sinks.k1.hdfs.round = true
video_hdfs.sinks.k1.hdfs.roundValue = 1
video_hdfs.sinks.k1.hdfs.roundUnit = hour

# Use a channel which buffers events in memory
video_hdfs.channels.c1.type = memory
video_hdfs.channels.c1.capacity = 20000
video_hdfs.channels.c1.transactionCapacity = 15000

# Bind the source and sink to the channel
video_hdfs.sources.s1.channels = c1
video_hdfs.sinks.k1.channel = c1

將日誌文件直接存儲在hdfs中對應的hive表及分區下。但是這樣直接通過hive語句查詢不到數據。

通過測試，手動創建對應的分區後可以查詢到數據。。。

建表語句使用默認textfile即可。

create table if not exists video_test ( 
timestamp string, 
userid string, 
videoid string, 
totaltime string, 
playtime string, 
rate string, 
hits string, 
praise string ) COMMENT '用戶視頻日誌' 
partitioned by (dayid string)
row format delimited fields terminated by ',' ;

待補充。

爲了將hive表與同步日誌關聯，需要定時創建表分區。首先編寫腳本如下：

create_partitions_video_log.sh

#!/bin/bash

#start
now=`date +%Y%m%d%H%M%S`
echo "now: "${now}
echo "START... NOW DATE:" $(date +"%Y-%m-%d %H:%M:%S")

#獲得後一天的日期
#todayday=`date -d now +%Y%m%d`
tomorrowday=`date -d "+1 day" +%Y%m%d`
echo "the date is: ${tomorrowday}..."
#echo "the date is: ${todayday}..."

#
#hive -e "USE recommend_video; ALTER TABLE t_di_user_action_log_id ADD IF NOT EXISTS PARTITION (dayid='${tomorrowday}');"
/usr/bin/hive -e "USE recommend_video; ALTER TABLE video_test ADD IF NOT EXISTS PARTITION (dayid='${tomorrowday}');"

linux服務器crontab定時任務設置如下：

#02 21 * * 1 sh /data/ceshi/hbase-hive-music.sh > /data/ceshi/music.log &
#22 20 * * 1 /data/ceshi/hbase-hive-novel.sh > /data/ceshi/novel.log &
#30 11 * * 1 sh /data/ceshi/hbase-hive-video.sh
#45 11 * * 1 sh /data/ceshi/hbase-hive-mother.sh
#0 12 * * 1 sh /data/ceshi/hbase-hive-sports.sh
06 16 * * * /bin/bash /root/lpz/create_partitions_video_log.sh >> /root/lpz/video_log.log 2>&1

中間會有各種報錯（根據不同服務器，可能不一樣）：

Unable to determine Hadoop version information.
'hadoop version' returned:

Error: JAVA_HOME is not set and could not be found.

解決方案：

/etc/profile文件中增加：export HADOOP_VERSION=2.6.0-cdh5.11.2

/usr/bin/hive文件中增加：export JAVA_HOME=/data/jdk1.8.0_171

不知道爲啥要特意添加這些常量，其實profile已經配置過了。。。。

Flume對接Hive（Sink）遇到的坑，以及最終放棄hive選用hdfs。歡迎討論指點

flume對接hdfs：

將數據從mysql導入到hive表

使用sqoop將數據定時從hive表導入MySQL

java高併發程序設計學習筆記九鎖的優化和注意事項

Flume對接Hive（Sink）遇到的坑，以及最終放棄hive選用hdfs。歡迎討論指點

java中jdk8的forEach()方法return血的教訓！

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結