RcFile存儲和讀取操作

工作中用到了RcFile來存儲和讀取RcFile格式的文件,記錄下。
RcFile是FaceBook開發的一個集行存儲和列存儲的優點於一身,壓縮比更高,讀取列更快,它在MapReduce環境中大規模數據處理中扮演着重要的角色。
讀取操作:

job信息:
Job job = new Job();
job.setJarByClass(類.class);
//設定輸入文件爲RcFile格式
job.setInputFormatClass(RCFileInputFormat.class);
//普通輸出
job.setOutputFormatClass(TextOutputFormat.class);
//設置輸入路徑
RCFileInputFormat.addInputPath(job, new Path(srcpath));
//MultipleInputs.addInputPath(job, new Path(srcpath), RCFileInputFormat.class);
// 輸出
TextOutputFormat.setOutputPath(job, new Path(respath));
// 輸出key格式
job.setOutputKeyClass(Text.class);
//輸出value格式
job.setOutputValueClass(NullWritable.class);
//設置mapper類
job.setMapperClass(ReadTestMapper.class);
//這裏沒設置reduce,reduce的操作就是讀Text類型文件,因爲mapper已經給轉換了。

code = (job.waitForCompletion(true)) ? 0 : 1;


// mapper 類

pulic class ReadTestMapper extends Mapper<LongWritable, BytesRefArrayWritable, Text, NullWritable> {

@Override
protected void map(LongWritable key, BytesRefArrayWritable value, Context context) throws IOException, InterruptedException {
// TODO Auto-generated method stub
Text txt = new Text();
//因爲RcFile行存儲和列存儲,所以每次進來的一行數據,Value是個列簇,遍歷,輸出。
StringBuffer sb = new StringBuffer();
for (int i = 0; i < value.size(); i++) {
BytesRefWritable v = value.get(i);
txt.set(v.getData(), v.getStart(), v.getLength());
if(i==value.size()-1){
sb.append(txt.toString());
}else{
sb.append(txt.toString()+"\t");
}
}
context.write(new Text(sb.toString()),NullWritable.get());
}
}



輸出壓縮爲RcFile格式:

job信息:
Job job = new Job();
Configuration conf = job.getConfiguration();
//設置每行的列簇數
RCFileOutputFormat.setColumnNumber(conf, 4);
job.setJarByClass(類.class);

FileInputFormat.setInputPaths(job, new Path(srcpath));
RCFileOutputFormat.setOutputPath(job, new Path(respath));

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(RCFileOutputFormat.class);

job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(BytesRefArrayWritable.class);

job.setMapperClass(OutPutTestMapper.class);

conf.set("date", line.getOptionValue(DATE));
//設置壓縮參數
conf.setBoolean("mapred.output.compress", true);
conf.set("mapred.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec");

code = (job.waitForCompletion(true)) ? 0 : 1;

mapper類:
public class OutPutTestMapper extends Mapper<LongWritable, Text, LongWritable, BytesRefArrayWritable> {
@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String day = context.getConfiguration().get("date");
if (!line.equals("")) {
String[] lines = line.split(" ", -1);
if (lines.length > 3) {
String time_temp = lines[1];
String times = timeStampDate(time_temp);
String d = times.substring(0, 10);
if (day.equals(d)) {
byte[][] record = {lines[0].getBytes("UTF-8"), lines[1].getBytes("UTF-8"),lines[2].getBytes("UTF-8"), lines[3].getBytes("UTF-8")};

BytesRefArrayWritable bytes = new BytesRefArrayWritable(record.length);

for (int i = 0; i < record.length; i++) {
BytesRefWritable cu = new BytesRefWritable(record[i], 0, record[i].length);
bytes.set(i, cu);
}
context.write(key, bytes);
}
}
}
}




轉載,請超鏈接形式標明文章原始出處和作者。
永久鏈接: http://smallboby.iteye.com/blog/1592531。
感謝。不當之處,請指教。
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章