Hadoop多文件輸出:MultipleOutputFormat和MultipleOutputs深究

源blog地址:http://www.iteblog.com/archives/848

 

由於本文比較長,考慮到篇幅問題,所以將本文拆分爲二,請閱讀本文之前先閱讀本文的第一部分《Hadoop多文件輸出:MultipleOutputFormat和MultipleOutputs深究(一)》。爲你帶來的不變,敬請諒解。

  與MultipleOutputFormat類不一樣的是,MultipleOutputs可以爲不同的輸出產生不同類型,到這裏所說的MultipleOutputs類還是舊版本的功能,後面會提到新版本類庫的強化版MultipleOutputs類,下面我們來用舊版本的MultipleOutputs類說明它是如何爲不同的輸出產生不同類型,MultipleOutputs類不是要求給每條記錄請求文件名,而是創建多個OutputCollectors。每個OutputCollector可以有自己的OutputFormat和鍵值對類型,Mapreduce程序將決定如何向每個OutputCollector輸出數據(看看上面的英文文檔),說的你很暈吧,來看看代碼吧!下面的代碼將地理相關的信息存儲在geo開頭的文件中;而將時間相關的信息存儲在chrono開頭的文件中,具體的代碼如下:

 
01 package com.wyp;
02
03 import org.apache.hadoop.conf.Configuration;
04 import org.apache.hadoop.fs.Path;
05 import org.apache.hadoop.io.LongWritable;
06 import org.apache.hadoop.io.NullWritable;
07 import org.apache.hadoop.io.Text;
08 import org.apache.hadoop.mapred.*;
09 import org.apache.hadoop.mapred.lib.MultipleOutputs;
10 import org.apache.hadoop.util.GenericOptionsParser;
11
12 import java.io.IOException;
13
14 /**
16 * Date: 13-11-27
17 * Time: 下午10:32
18 */
19 public class OldMulOutput {
20 publicstaticclass MapClass
21 extendsMapReduceBase
22 implementsMapper<LongWritable,
23 Text, NullWritable, Text> {
24 privateMultipleOutputs mos;
25 privateOutputCollector<NullWritable, Text> collector;
26
27 publicvoidconfigure(JobConf conf) {
28 mos = new MultipleOutputs(conf);
29 }
30
31 publicvoidmap(LongWritable key, Text value,
32 OutputCollector<NullWritable, Text> output,
33 Reporter reporter)throwsIOException {
34 String[] arr = value.toString().split(",", -1);
35 String chrono = arr[1] +","+ arr[2];
36 String geo = arr[4] +","+ arr[5];
37 collector = mos.getCollector("chrono", reporter);
38 collector.collect(NullWritable.get(),newText(chrono));
39 collector = mos.getCollector("geo", reporter);
40 collector.collect(NullWritable.get(),newText(geo));
41 }
42
43 publicvoidclose() throwsIOException {
44 mos.close();
45 }
46
47 publicstaticvoid main(String[] args)throws IOException {
48 Configuration conf =newConfiguration();
49 String[] remainingArgs =
50 newGenericOptionsParser(conf, args).getRemainingArgs();
51
52 if(remainingArgs.length !=2) {
53 System.err.println("Error!");
54 System.exit(1);
55 }
56
57 JobConf job =newJobConf(conf, OldMulOutput.class);
58 Path in =newPath(remainingArgs[0]);
59 Path out =newPath(remainingArgs[1]);
60 FileInputFormat.setInputPaths(job, in);
61 FileOutputFormat.setOutputPath(job, out);
62
63 job.setJobName("MultiFile");
64 job.setMapperClass(MapClass.class);
65 job.setInputFormat(TextInputFormat.class);
66 job.setOutputKeyClass(NullWritable.class);
67 job.setOutputValueClass(Text.class);
68
69 job.setNumReduceTasks(0);
70 MultipleOutputs.addNamedOutput(job,
71 "chrono",
72 TextOutputFormat.class,
73 NullWritable.class,
74 Text.class);
75
76 MultipleOutputs.addNamedOutput(job,
77 "geo",
78 TextOutputFormat.class,
79 NullWritable.class,
80 Text.class);
81 JobClient.runJob(job);
82
83 }
84 }
85 }

上面程序來源《Hadoop in action》。同樣將上面的程序打包成jar文件(具體怎麼打包,也不說了),並在Hadoop2.2.0上面運行(測試數據請在這裏下載:http://pan.baidu.com/s/1td8xN):

 
1 /home/q/hadoop-2.2.0/bin/hadoop jar \
2 /export1/tmp/wyp/OutputText.jar com.wyp.OldMulOutput \
3 /home/wyp/apat63_99.txt \
4 /home/wyp/out5

運行完程序之後,可以去/home/wyp/out5目錄看下運行結果:

 
01 [wyp@l-datalogm1.data.cn1 bin]$ /home/q/hadoop-2.2.0/bin/hadoop fs \
02 -ls /home/wyp/out5
03 Found 7items
04 -rw-r--r-- 3wyp sg0 2013-11-2614:57/home/wyp/out5/_SUCCESS
05 -rw-r--r-- 3wyp sg31243 2013-11-2615:57/home/wyp/out5/chrono-m-00000
06 -rw-r--r-- 3wyp sg22719 2013-11-2615:57/home/wyp/out5/chrono-m-00001
07 -rw-r--r-- 3wyp sg29922 2013-11-2615:57/home/wyp/out5/geo-m-00000
08 -rw-r--r-- 3wyp sg20429 2013-11-2615:57/home/wyp/out5/geo-m-00001
09 -rw-r--r-- 3wyp sg0 2013-11-2614:57/home/wyp/out5/part-m-00000
10 -rw-r--r-- 3wyp sg0 2013-11-2614:57/home/wyp/out5/part-m-00001

  大家可以看到在輸出的文件中還存在以part開頭的文件,但是裏面沒有信息,這是程序默認的輸出文件,輸出的收集器的名稱是不能爲part的,這是因爲它已經被使用爲默認的值了。
  從上面的程序可以看出,舊版本的MultipleOutputs可以將文件基於列來進行分割,但是如果你想進行基於行的分割,這就要求你自己去實現代碼了,恨不方便,針對這個問題,新版本的MultipleOutputs已經將舊版本的MultipleOutputs和MultipleOutputFormat的功能合併了,也就是說新版本的MultipleOutputs類具有舊版本的MultipleOutputs功能和MultipleOutputFormat功能;同時,在新版本的類庫中已經不存在MultipleOutputFormat類了,因爲MultipleOutputs都有它的功能了,還要她幹嘛呢?看看官方文檔是怎麼說的:

  The MultipleOutputs class simplifies writing output data to multiple outputs
  Case one: writing to additional outputs other than the job default output. Each additional output, or named output, may be configured with its own OutputFormat, with its own key class and with its own value class.
  Case two: to write data to different files provided by user

再看看下面摘自Hadoop:The.Definitive.Guide(3rd,Early.Release)P251,它是怎麼說的:

  In the old MapReduce API there are two classes for producing multiple outputs: MultipleOutputFormat and MultipleOutputs. In a nutshell, MultipleOutputs is more fully featured, but MultipleOutputFormat has more control over the output directory structure and file naming. MultipleOutputs in the new API combines the best features of the two multiple output classes in the old API.

  也就是說MultipleOutputs合併了舊版本的MultipleOutputs功能和MultipleOutputFormat功能,新api都是用mapreduce包。好了,剛剛也說了新版本的MultipleOutputs有了舊版本的MultipleOutputFormat功能,那麼我該怎麼在新版的MultipleOutputs中實現舊版本MultipleOutputFormat的多文件輸出呢?也就是上面第一個程序。看看下面的代碼吧。

 
01 package com.wyp;
02
03 import org.apache.hadoop.conf.Configuration;
04 import org.apache.hadoop.fs.Path;
05 import org.apache.hadoop.io.LongWritable;
06 import org.apache.hadoop.io.NullWritable;
07 import org.apache.hadoop.io.Text;
08 import org.apache.hadoop.mapreduce.Job;
09 import org.apache.hadoop.mapreduce.Mapper;
10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
14 import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
15 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
16 import org.apache.hadoop.util.GenericOptionsParser;
17
18 import java.io.IOException;
19
20 /**
22 * Date: 13-11-26
23 * Time: 下午2:27
24 */
25 public class MulOutput {
26 publicstaticclass MapClass
27 extendsMapper<LongWritable, Text, NullWritable, Text> {
28 privateMultipleOutputs mos;
29 @Override
30 protectedvoidsetup(Context context)
31 throwsIOException, InterruptedException {
32 super.setup(context);
33 mos = new MultipleOutputs(context);
34 }
35
36 @Override
37 protectedvoidmap(LongWritable key,
38 Text value,
39 Context context)
40 throwsIOException, InterruptedException {
41 mos.write(NullWritable.get(), value,
42 generateFileName(value));
43 }
44
45 privateString generateFileName(Text value) {
46 String[] split = value.toString().split(",", -1);
47 String country = split[4].substring(1,3);
48 returncountry +"/";
49 }
50
51 @Override
52 protectedvoidcleanup(Context context)
53 throwsIOException, InterruptedException {
54 super.cleanup(context);
55 mos.close();
56 }
57 }
58
59 publicstaticvoid main(String[] args)
60 throwsIOException, ClassNotFoundException,
61 InterruptedException {
62 Configuration conf =newConfiguration();
63 Job job = Job.getInstance(conf,"MulOutput");
64 String[] remainingArgs =
65 newGenericOptionsParser(conf, args)
66 .getRemainingArgs();
67
68 if(remainingArgs.length !=2) {
69 System.err.println("Error!");
70 System.exit(1);
71 }
72 Path in =newPath(remainingArgs[0]);
73 Path out =newPath(remainingArgs[1]);
74
75 FileInputFormat.setInputPaths(job, in);
76 FileOutputFormat.setOutputPath(job, out);
77
78 job.setJarByClass(MulOutput.class);
79 job.setMapperClass(MapClass.class);
80 job.setInputFormatClass(TextInputFormat.class);
81 job.setOutputKeyClass(NullWritable.class);
82 job.setOutputValueClass(Text.class);
83 job.setNumReduceTasks(0);
84
85 System.exit(job.waitForCompletion(true) ?0: 1);
86 }
87 }

上面的程序通過setup(Context context)來初始化MultipleOutputs對象,並在mapper函數中調用MultipleOutputs的write方法將數據輸出到根據value的值不同的文件夾中(通過調用generateFileName函數來處理)。MultipleOutputs類有多個不同版本的write方法,它們的函數原型如下:

 
1 public <K, V> void write(String namedOutput, K key, V value)
2 throwsIOException, InterruptedException
3
4 public <K, V> void write(String namedOutput, K key, V value,
5 String baseOutputPath)throwsIOException, InterruptedException
6
7 public void write(KEYOUT key, VALUEOUT value, String baseOutputPath)
8 throwsIOException, InterruptedException

我們可以根據不同的需求調用不同的write方法。
好了,大家來看看上面程序運行的結果吧:

 
1 /home/q/hadoop-2.2.0/bin/hadoop jar \
2 /export1/tmp/wyp/OutputText.jar com.wyp.MulOutput \
3 /home/wyp/apat63_99.txt \
4 /home/wyp/out11

測試數據還是上面給的地址。看下/home/wyp/out11文件中有什麼吧:

 
01 [wyp@l-datalogm1.data.cn1 bin]$ /home/q/hadoop-2.2.0/bin/hadoop fs \
02 -ls /home/wyp/out11
03 .............................這裏省略了很多...................................
04 drwxr-xr-x - wyp supergroup 02013-11-2619:42/home/wyp/out11/VN
05 drwxr-xr-x - wyp supergroup 02013-11-2619:41/home/wyp/out11/VU
06 drwxr-xr-x - wyp supergroup 02013-11-2619:42/home/wyp/out11/YE
07 drwxr-xr-x - wyp supergroup 02013-11-2619:42/home/wyp/out11/YU
08 drwxr-xr-x - wyp supergroup 02013-11-2619:42/home/wyp/out11/ZA
09 .............................這裏省略了很多...................................
10 -rw-r--r-- 3wyp supergroup0 2013-11-2619:42/home/wyp/out11/_SUCCESS
11 -rw-r--r-- 3wyp supergroup0 2013-11-2619:42/home/wyp/out11/part-m-00000
12 -rw-r--r-- 3wyp supergroup0 2013-11-2619:42/home/wyp/out11/part-m-00001

  現在輸出的結果和用舊版本的MultipleOutputFormat輸出的結果很類似了;但是在輸出的結果中還有兩個以part開頭的文件夾,而且裏面什麼都沒有,這是怎麼回事?和第二個測試程序一樣,這也是程序默認的輸出文件名。那麼我們可以在程序輸出的結果中不輸出兩個文件夾嗎?當然可以了,呵呵。如何實現呢?其實很簡單,在上面的代碼的main函數中加入以下一行代碼:

 
1 LazyOutputFormat.setOutputFormatClass(job,
2 TextOutputFormat.class);

如果加入了上面的一行代碼,請同時註釋掉你代碼中下面一行代碼(如果有)

 
1 job.setOutputFormatClass(TextOutputFormat.class);

再去看下輸出結果吧:

 
01 [wyp@l-datalogm1.data.cn1 bin]$ /home/q/hadoop-2.2.0/bin/hadoop fs \
02 -ls /home/wyp/out12
03 .............................這裏省略了很多...................................
04 drwxr-xr-x - wyp supergroup 02013-11-2619:44/home/wyp/out12/VU
05 drwxr-xr-x - wyp supergroup 02013-11-2619:44/home/wyp/out12/YE
06 drwxr-xr-x - wyp supergroup 02013-11-2619:44/home/wyp/out12/YU
07 drwxr-xr-x - wyp supergroup 02013-11-2619:44/home/wyp/out12/ZA
08 drwxr-xr-x - wyp supergroup 02013-11-2619:44/home/wyp/out12/ZM
09 drwxr-xr-x - wyp supergroup 02013-11-2619:44/home/wyp/out12/ZW
10 .............................這裏省略了很多...................................
11 -rw-r--r-- 3wyp supergroup0 2013-11-2619:44/home/wyp/out12/_SUCCESS

結果完全和例子一一樣。(本文完)

發佈了40 篇原創文章 · 獲贊 2 · 訪問量 6萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章