源blog地址:http://www.iteblog.com/archives/848
由於本文比較長,考慮到篇幅問題,所以將本文拆分爲二,請閱讀本文之前先閱讀本文的第一部分《Hadoop多文件輸出:MultipleOutputFormat和MultipleOutputs深究(一)》。爲你帶來的不變,敬請諒解。
與MultipleOutputFormat類不一樣的是,MultipleOutputs可以爲不同的輸出產生不同類型,到這裏所說的MultipleOutputs類還是舊版本的功能,後面會提到新版本類庫的強化版MultipleOutputs類,下面我們來用舊版本的MultipleOutputs類說明它是如何爲不同的輸出產生不同類型,MultipleOutputs類不是要求給每條記錄請求文件名,而是創建多個OutputCollectors。每個OutputCollector可以有自己的OutputFormat和鍵值對類型,Mapreduce程序將決定如何向每個OutputCollector輸出數據(看看上面的英文文檔),說的你很暈吧,來看看代碼吧!下面的代碼將地理相關的信息存儲在geo開頭的文件中;而將時間相關的信息存儲在chrono開頭的文件中,具體的代碼如下:
03 |
import
org.apache.hadoop.conf.Configuration; |
04 |
import
org.apache.hadoop.fs.Path; |
05 |
import
org.apache.hadoop.io.LongWritable; |
06 |
import
org.apache.hadoop.io.NullWritable; |
07 |
import
org.apache.hadoop.io.Text; |
08 |
import
org.apache.hadoop.mapred.*; |
09 |
import
org.apache.hadoop.mapred.lib.MultipleOutputs; |
10 |
import
org.apache.hadoop.util.GenericOptionsParser; |
12 |
import
java.io.IOException; |
19 |
public
class OldMulOutput { |
20 |
public static class
MapClass |
22 |
implements Mapper<LongWritable, |
23 |
Text, NullWritable, Text> { |
24 |
private MultipleOutputs mos; |
25 |
private OutputCollector<NullWritable, Text> collector; |
27 |
public void configure(JobConf conf) { |
28 |
mos =
new MultipleOutputs(conf); |
31 |
public void map(LongWritable key, Text value, |
32 |
OutputCollector<NullWritable, Text> output, |
33 |
Reporter reporter) throws IOException { |
34 |
String[] arr = value.toString().split( "," , - 1 ); |
35 |
String chrono = arr[ 1 ] + "," + arr[ 2 ]; |
36 |
String geo = arr[ 4 ] + "," + arr[ 5 ]; |
37 |
collector = mos.getCollector( "chrono" , reporter); |
38 |
collector.collect(NullWritable.get(), new Text(chrono)); |
39 |
collector = mos.getCollector( "geo" , reporter); |
40 |
collector.collect(NullWritable.get(), new Text(geo)); |
43 |
public void close()
throws IOException { |
47 |
public static void
main(String[] args) throws
IOException { |
48 |
Configuration conf = new Configuration(); |
49 |
String[] remainingArgs = |
50 |
new GenericOptionsParser(conf, args).getRemainingArgs(); |
52 |
if (remainingArgs.length != 2 ) { |
53 |
System.err.println( "Error!" ); |
57 |
JobConf job = new JobConf(conf, OldMulOutput. class ); |
58 |
Path in = new Path(remainingArgs[ 0 ]); |
59 |
Path out = new Path(remainingArgs[ 1 ]); |
60 |
FileInputFormat.setInputPaths(job, in); |
61 |
FileOutputFormat.setOutputPath(job, out); |
63 |
job.setJobName( "MultiFile" ); |
64 |
job.setMapperClass(MapClass. class ); |
65 |
job.setInputFormat(TextInputFormat. class ); |
66 |
job.setOutputKeyClass(NullWritable. class ); |
67 |
job.setOutputValueClass(Text. class ); |
69 |
job.setNumReduceTasks( 0 ); |
70 |
MultipleOutputs.addNamedOutput(job, |
72 |
TextOutputFormat. class , |
76 |
MultipleOutputs.addNamedOutput(job, |
78 |
TextOutputFormat. class , |
81 |
JobClient.runJob(job); |
上面程序來源《Hadoop in action》。同樣將上面的程序打包成jar文件(具體怎麼打包,也不說了),並在Hadoop2.2.0上面運行(測試數據請在這裏下載:http://pan.baidu.com/s/1td8xN):
1 |
/home/q/hadoop- 2.2 . 0 /bin/hadoop jar \ |
2 |
/export1/tmp/wyp/OutputText.jar com.wyp.OldMulOutput \ |
3 |
/home/wyp/apat63_99.txt \ |
運行完程序之後,可以去/home/wyp/out5目錄看下運行結果:
01 |
[wyp @l -datalogm1.data.cn1 bin]$ /home/q/hadoop- 2.2 . 0 /bin/hadoop
fs \ |
04 |
-rw-r--r-- 3 wyp sg 0
2013 - 11 - 26 14 : 57 /home/wyp/out5/_SUCCESS |
05 |
-rw-r--r-- 3 wyp sg 31243
2013 - 11 - 26 15 : 57 /home/wyp/out5/chrono-m- 00000 |
06 |
-rw-r--r-- 3 wyp sg 22719
2013 - 11 - 26 15 : 57 /home/wyp/out5/chrono-m- 00001 |
07 |
-rw-r--r-- 3 wyp sg 29922
2013 - 11 - 26 15 : 57 /home/wyp/out5/geo-m- 00000 |
08 |
-rw-r--r-- 3 wyp sg 20429
2013 - 11 - 26 15 : 57 /home/wyp/out5/geo-m- 00001 |
09 |
-rw-r--r-- 3 wyp sg 0
2013 - 11 - 26 14 : 57 /home/wyp/out5/part-m- 00000 |
10 |
-rw-r--r-- 3 wyp sg 0
2013 - 11 - 26 14 : 57 /home/wyp/out5/part-m- 00001 |
大家可以看到在輸出的文件中還存在以part開頭的文件,但是裏面沒有信息,這是程序默認的輸出文件,輸出的收集器的名稱是不能爲part的,這是因爲它已經被使用爲默認的值了。
從上面的程序可以看出,舊版本的MultipleOutputs可以將文件基於列來進行分割,但是如果你想進行基於行的分割,這就要求你自己去實現代碼了,恨不方便,針對這個問題,新版本的MultipleOutputs已經將舊版本的MultipleOutputs和MultipleOutputFormat的功能合併了,也就是說新版本的MultipleOutputs類具有舊版本的MultipleOutputs功能和MultipleOutputFormat功能;同時,在新版本的類庫中已經不存在MultipleOutputFormat類了,因爲MultipleOutputs都有它的功能了,還要她幹嘛呢?看看官方文檔是怎麼說的:
The MultipleOutputs class simplifies writing output data to multiple outputs
Case one: writing to additional outputs other than the job default output. Each additional output, or named output, may be configured with its own OutputFormat, with its own key class and with its own value class.
Case two: to write data to different files provided by user
再看看下面摘自Hadoop:The.Definitive.Guide(3rd,Early.Release)P251,它是怎麼說的:
In the old MapReduce API there are two classes for producing multiple outputs: MultipleOutputFormat and MultipleOutputs. In a nutshell, MultipleOutputs is more fully featured, but MultipleOutputFormat has more control over the output directory
structure and file naming. MultipleOutputs in the new API combines the best features of the two multiple output classes in the old API.
也就是說MultipleOutputs合併了舊版本的MultipleOutputs功能和MultipleOutputFormat功能,新api都是用mapreduce包。好了,剛剛也說了新版本的MultipleOutputs有了舊版本的MultipleOutputFormat功能,那麼我該怎麼在新版的MultipleOutputs中實現舊版本MultipleOutputFormat的多文件輸出呢?也就是上面第一個程序。看看下面的代碼吧。
03 |
import
org.apache.hadoop.conf.Configuration; |
04 |
import
org.apache.hadoop.fs.Path; |
05 |
import
org.apache.hadoop.io.LongWritable; |
06 |
import
org.apache.hadoop.io.NullWritable; |
07 |
import
org.apache.hadoop.io.Text; |
08 |
import
org.apache.hadoop.mapreduce.Job; |
09 |
import
org.apache.hadoop.mapreduce.Mapper; |
10 |
import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat; |
11 |
import
org.apache.hadoop.mapreduce.lib.input.TextInputFormat; |
12 |
import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; |
13 |
import
org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat; |
14 |
import
org.apache.hadoop.mapreduce.lib.output.MultipleOutputs; |
15 |
import
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; |
16 |
import
org.apache.hadoop.util.GenericOptionsParser; |
18 |
import
java.io.IOException; |
25 |
public
class MulOutput { |
26 |
public static class
MapClass |
27 |
extends Mapper<LongWritable, Text, NullWritable, Text> { |
28 |
private MultipleOutputs mos; |
30 |
protected void setup(Context context) |
31 |
throws IOException, InterruptedException { |
33 |
mos =
new MultipleOutputs(context); |
37 |
protected void map(LongWritable key, |
40 |
throws IOException, InterruptedException { |
41 |
mos.write(NullWritable.get(), value, |
42 |
generateFileName(value)); |
45 |
private String generateFileName(Text value) { |
46 |
String[] split = value.toString().split( "," , - 1 ); |
47 |
String country = split[ 4 ].substring( 1 , 3 ); |
52 |
protected void cleanup(Context context) |
53 |
throws IOException, InterruptedException { |
54 |
super .cleanup(context); |
59 |
public static void
main(String[] args) |
60 |
throws IOException, ClassNotFoundException, |
61 |
InterruptedException { |
62 |
Configuration conf = new Configuration(); |
63 |
Job job = Job.getInstance(conf, "MulOutput" ); |
64 |
String[] remainingArgs = |
65 |
new GenericOptionsParser(conf, args) |
68 |
if (remainingArgs.length != 2 ) { |
69 |
System.err.println( "Error!" ); |
72 |
Path in = new Path(remainingArgs[ 0 ]); |
73 |
Path out = new Path(remainingArgs[ 1 ]); |
75 |
FileInputFormat.setInputPaths(job, in); |
76 |
FileOutputFormat.setOutputPath(job, out); |
78 |
job.setJarByClass(MulOutput. class ); |
79 |
job.setMapperClass(MapClass. class ); |
80 |
job.setInputFormatClass(TextInputFormat. class ); |
81 |
job.setOutputKeyClass(NullWritable. class ); |
82 |
job.setOutputValueClass(Text. class ); |
83 |
job.setNumReduceTasks( 0 ); |
85 |
System.exit(job.waitForCompletion( true ) ? 0 :
1 ); |
上面的程序通過setup(Context context)來初始化MultipleOutputs對象,並在mapper函數中調用MultipleOutputs的write方法將數據輸出到根據value的值不同的文件夾中(通過調用generateFileName函數來處理)。MultipleOutputs類有多個不同版本的write方法,它們的函數原型如下:
1 |
public
<K, V> void write(String namedOutput, K key, V value) |
2 |
throws IOException, InterruptedException |
4 |
public
<K, V> void write(String namedOutput, K key, V value, |
5 |
String baseOutputPath) throws IOException, InterruptedException |
7 |
public
void write(KEYOUT key, VALUEOUT value, String baseOutputPath) |
8 |
throws IOException, InterruptedException |
我們可以根據不同的需求調用不同的write方法。
好了,大家來看看上面程序運行的結果吧:
1 |
/home/q/hadoop- 2.2 . 0 /bin/hadoop jar \ |
2 |
/export1/tmp/wyp/OutputText.jar com.wyp.MulOutput \ |
3 |
/home/wyp/apat63_99.txt \ |
測試數據還是上面給的地址。看下/home/wyp/out11文件中有什麼吧:
01 |
[wyp @l -datalogm1.data.cn1 bin]$ /home/q/hadoop- 2.2 . 0 /bin/hadoop
fs \ |
03 |
.............................這裏省略了很多................................... |
04 |
drwxr-xr-x - wyp supergroup 0 2013 - 11 - 26 19 : 42 /home/wyp/out11/VN |
05 |
drwxr-xr-x - wyp supergroup 0 2013 - 11 - 26 19 : 41 /home/wyp/out11/VU |
06 |
drwxr-xr-x - wyp supergroup 0 2013 - 11 - 26 19 : 42 /home/wyp/out11/YE |
07 |
drwxr-xr-x - wyp supergroup 0 2013 - 11 - 26 19 : 42 /home/wyp/out11/YU |
08 |
drwxr-xr-x - wyp supergroup 0 2013 - 11 - 26 19 : 42 /home/wyp/out11/ZA |
09 |
.............................這裏省略了很多................................... |
10 |
-rw-r--r-- 3 wyp supergroup 0
2013 - 11 - 26 19 : 42 /home/wyp/out11/_SUCCESS |
11 |
-rw-r--r-- 3 wyp supergroup 0
2013 - 11 - 26 19 : 42 /home/wyp/out11/part-m- 00000 |
12 |
-rw-r--r-- 3 wyp supergroup 0
2013 - 11 - 26 19 : 42 /home/wyp/out11/part-m- 00001 |
現在輸出的結果和用舊版本的MultipleOutputFormat輸出的結果很類似了;但是在輸出的結果中還有兩個以part開頭的文件夾,而且裏面什麼都沒有,這是怎麼回事?和第二個測試程序一樣,這也是程序默認的輸出文件名。那麼我們可以在程序輸出的結果中不輸出兩個文件夾嗎?當然可以了,呵呵。如何實現呢?其實很簡單,在上面的代碼的main函數中加入以下一行代碼:
1 |
LazyOutputFormat.setOutputFormatClass(job, |
2 |
TextOutputFormat. class ); |
如果加入了上面的一行代碼,請同時註釋掉你代碼中下面一行代碼(如果有)
1 |
job.setOutputFormatClass(TextOutputFormat. class ); |
再去看下輸出結果吧:
01 |
[wyp @l -datalogm1.data.cn1 bin]$ /home/q/hadoop- 2.2 . 0 /bin/hadoop
fs \ |
03 |
.............................這裏省略了很多................................... |
04 |
drwxr-xr-x - wyp supergroup 0 2013 - 11 - 26 19 : 44 /home/wyp/out12/VU |
05 |
drwxr-xr-x - wyp supergroup 0 2013 - 11 - 26 19 : 44 /home/wyp/out12/YE |
06 |
drwxr-xr-x - wyp supergroup 0 2013 - 11 - 26 19 : 44 /home/wyp/out12/YU |
07 |
drwxr-xr-x - wyp supergroup 0 2013 - 11 - 26 19 : 44 /home/wyp/out12/ZA |
08 |
drwxr-xr-x - wyp supergroup 0 2013 - 11 - 26 19 : 44 /home/wyp/out12/ZM |
09 |
drwxr-xr-x - wyp supergroup 0 2013 - 11 - 26 19 : 44 /home/wyp/out12/ZW |
10 |
.............................這裏省略了很多................................... |
11 |
-rw-r--r-- 3 wyp supergroup 0
2013 - 11 - 26 19 : 44 /home/wyp/out12/_SUCCESS |
結果完全和例子一一樣。(本文完)