HCE, short for Hadoop c++ extension
據說效率可以比傳統Hadoop提高20%以上,計劃過幾天用倒排索引測試其效率。暫定使用3臺節點,每個節點16核cpu。
一天半的時間學習hadoop和hce的部署,並在CentOS5.4上成功部署僞分佈式hce,提交自己編譯通過的mapreduce程序wordcount,得到正確結果。
配置過程以及遇到的問題:
下載hce源碼後,編譯過程中遇到如下錯誤:
1.多餘的名稱限定:HCE:Compressor 解決方法: 在代碼中去掉限定HCE
代碼位置:src/c++/hce/impl/Compressor
2.找不到符號:htons 解決方法: 改變引用的頭文件。不要使用系統相關的頭文件,即 linux/目錄下的。
#include <linux/in.h>
#include <linux/in6.h>
註釋,增加 #include <netinet/in.h>
鏈接時可能遇到找不到 -lncurses的錯誤
需要安裝ncurses-devel。對於centos,可使用yum安裝。
編譯成功後生成build目錄下的若干文件
然後是配置運行階段:
配置conf/ 下的core-site.xml mapred-site.xml hdfs-site.xml
主要是配置各個服務的IP地址和端口,hadoop的各個服務將在配置的地址上開啓。
運行階段很容易發生無法正常啓動某daemon的現象,這裏的錯誤原因可能性比較多,推薦使用一種雖然繁瑣但比較保險的做法:按順序分別啓動服務
首先要格式化hdfs,bin/hadoop namenode -format
然後按順序啓動daemons,hadoop主要包括四個daemons: namenode, datanode, jobtracker, tasktracker
按順序啓動:
bin/hadoop-daemon start namenode
bin/hadoop-daemon start datanode
bin/hadoop-daemon start jobtracker
bin/hadoop-daemon start tasktracker
可以邊啓動邊去logs裏查看日誌,看是否啓動成功。
啓動成功後,使用bin/hadoop fs 系列命令建立好輸入/出目錄input/output, 將輸入文件上傳hdfs。
然後該編寫我們的c++版的mapreduce程序wordcount了,代碼如下:
#include "hadoop/Hce.hh"
class WordCountMap: public HCE::Mapper {
public:
HCE::TaskContext::Counter* inputWords;
int64_t setup() {
inputWords = getContext()->getCounter("WordCount",
"Input Words");
return 0;
}
int64_t map(HCE::MapInput &input) {
int64_t size = 0;
const void* value = input.value(size);
if ((size > 0) && (NULL != value)) {
char* text = (char*)value;
const int n = (int)size;
for (int i = 0; i < n;) {
// Skip past leading whitespace
while ((i < n) && isspace(text[i])) i++;
// Find word end
int start = i;
while ((i < n) && !isspace(text[i])) i++;
if (start < i) {
emit(text + start, i-start, "1", 1);
getContext()->incrementCounter(inputWords, 1);
}
}
}
return 0;
}
int64_t cleanup() {
return 0;
}
};
const int INT64_MAXLEN = 25;
int64_t toInt64(const char *val) {
int64_t result;
char trash;
int num = sscanf(val, "%ld%c", &result, &trash);
return result;
}
class WordCountReduce: public HCE::Reducer {
public:
HCE::TaskContext::Counter* outputWords;
int64_t setup() {
outputWords = getContext()->getCounter("WordCount",
"Output Words");
return 0;
}
int64_t reduce(HCE::ReduceInput &input) {
int64_t keyLength;
const void* key = input.key(keyLength);
int64_t sum = 0;
while (input.nextValue()) {
int64_t valueLength;
const void* value = input.value(valueLength);
sum += toInt64((const char*)value);
}
char str[INT64_MAXLEN];
int str_len = snprintf(str, INT64_MAXLEN, "%ld", sum);
getContext()->incrementCounter(outputWords, 1);
emit(key, keyLength, str, str_len);
}
int64_t cleanup() {
return 0;
}
};
int main(int argc, char *argv[]) {
return HCE::runTask(
//TemplateFactory sequence is Mapper, Reducer,
// Partitioner, Combiner, Committer,
// RecordReader, RecordWriter
HCE::TemplateFactory<WordCountMap, WordCountReduce,
void, void, void, void, void>()
);
}
Makefile如下:
HADOOP_HOME = ../hadoop-0.20.3/build
JAVA_HOME = ../java6
INCLUDEDIR = ../hadoop-0.20.3/build/c++/Linux-amd64-64/include
LIBDIR = ../hadoop-0.20.3/build/c++/Linux-amd64-64/lib
CXX=g++
RM=rm -f
INCLUDEDIR = -I${HADOOP_HOME}/c++/Linux-amd64-64/include
LIBDIR = -L${HADOOP_HOME}/c++/Linux-amd64-64/lib \
-L${JAVA_HOME}/jre/lib/amd64/server
CXXFLAGS = ${INCLUDEDIR} -g -Wextra -Werror \
-Wno-unused-parameter -Wformat \
-Wconversion -Wdeprecated
LDLIBS = ${LIBDIR} -lhce -lhdfs -ljvm
all : wordcount-demo
wordcount-demo : wordcount-demo.o
$(CXX) -o $@ $^ $(LDLIBS) $(CXXFLAGS)
clean:
$(RM) *.o wordcount-demo
編譯成功後就可以提交hce作業了:
bin/hadoop hce -input /input/test -output /output/out1 -program wordcount-demo -file wordcount-demo -numReduceTasks 1
這裏使用到的輸入文件 input/test內容如下:
The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog.
提交作業後可能遇到錯誤:job not successful
查看日誌,有如下錯誤提示:
stderr logs:
..........
HCE_FATAL 08-10 12:13:51 [/home/shengeng/hce/hadoop_hce_v1/hadoop-0.20.3/src/c++/hce/impl/MapRed/Hce.cc][176][runTask] error when parsing UgiInfo at /home/shengeng/hce/hadoop_hce_v1/hadoop-0.20.3/src/c++/hce/impl/MapRed/HadoopCommitter.cc:247 in virtual bool HCE::HadoopCommitter::needsTaskCommit() syslog logs: .......................
2011-08-10 12:13:51,450 ERROR org.apache.hadoop.mapred.hce.BinaryProtocol: java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:250) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298) at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319) at org.apache.hadoop.mapred.hce.BinaryProtocol$UplinkReaderThread.run(BinaryProtocol.java:112) 2011-08-10 12:13:51,450 ERROR org.apache.hadoop.mapred.hce.Application: Aborting because of java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:250) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298) at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319) at org.apache.hadoop.mapred.hce.BinaryProtocol$UplinkReaderThread.run(BinaryProtocol.java:112) 2011-08-10 12:13:51,450 INFO org.apache.hadoop.mapred.hce.BinaryProtocol: Sent abort command 2011-08-10 12:13:51,496 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.io.IOException: hce child exception at org.apache.hadoop.mapred.hce.Application.abort(Application.java:325) at org.apache.hadoop.mapred.hce.HceMapRunner.run(HceMapRunner.java:87) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:369) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:250) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298) at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319) at org.apache.hadoop.mapred.hce.BinaryProtocol$UplinkReaderThread.run(BinaryProtocol.java:112) 2011-08-10 12:13:51,500 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task
根據日誌定位到代碼:
在HadoopCommitter.cc中,
bool HadoopCommitter::needsTaskCommit()
string ugiInfo = taskContext->getJobConf()->get("hadoop.job.ugi"); //這裏去找hadoop.job.ugi這個配置項但是默認的hce配置文件中沒有此項
words = HadoopUtils::splitString(ugiInfo, ",");
HADOOP_ASSERT(words.size() ==2, "error when parsing UgiInfo"); //所以在這裏拋出異常了
在hdfs-site.xml中添加配置項: <property> <name>hadoop.job.ugi</name> <value>hadoop,supergroup</value> </property>
又觀察代碼可以推斷,此配置項在hce中並未生效,在needsTaskCommit()函數中僅僅是去讀取了此配置項,但未使用到其值。