hadoop 裏執行 MapReduce 任務的幾種常見方式

說明:

測試文件:

1 echo -e "aa\tbb \tcc\nbb\tcc\tdd" > 3.txt
1 hadoop fs -put 3.txt /tmp/3.txt

全文的例子均以該文件做測試用例,統計單詞出現的次數(WordCount)。

1、原生態的方式:java 源碼編譯打包成jar包後,由 hadoop 腳本調度執行,舉例:

01 import java.io.IOException;
02 import java.util.StringTokenizer;
03  
04 import org.apache.hadoop.conf.Configuration;
05 import org.apache.hadoop.fs.Path;
06 import org.apache.hadoop.io.IntWritable;
07 import org.apache.hadoop.io.Text;
08 import org.apache.hadoop.mapreduce.Job;
09 import org.apache.hadoop.mapreduce.Mapper;
10 import org.apache.hadoop.mapreduce.Reducer;
11 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 import org.apache.hadoop.util.GenericOptionsParser;
14  
15 public class WordCount {
16  
17     public static class TokenizerMapper extends
18             Mapper<Object, Text, Text, IntWritable> {
19         /** 
20          * LongWritable, IntWritable, Text 均是 Hadoop 中實現的用於封裝 Java 數據類型的類,
21          * 這些類實現了WritableComparable接口, 
22          * 都能夠被串行化從而便於在分佈式環境中進行數據交換,
23          * 你可以將它們分別視爲long,int,String 的替代品。 
24          */  
25         // IntWritable one 相當於 java 原生類型 int 1
26         private final static IntWritable one = new IntWritable(1);
27         private Text word = new Text();
28  
29         public void map(Object key, Text value, Context context)
30                 throws IOException, InterruptedException {
31             // 每行記錄都會調用 map 方法處理,此處是每行都被分詞
32             StringTokenizer itr = new StringTokenizer(value.toString());
33             while (itr.hasMoreTokens()) {
34                 word.set(itr.nextToken());
35                 // 輸出每個詞及其出現的次數 1,類似 <word1,1><word2,1><word1,1>
36                 context.write(word, one);
37             }
38         }
39     }
40  
41     public static class IntSumReducer extends
42             Reducer<Text, IntWritable, Text, IntWritable> {
43         private IntWritable result = new IntWritable();
44  
45         public void reduce(Text key, Iterable<IntWritable> values,
46                 Context context) throws IOException, InterruptedException {
47             // key 相同的鍵值對會被分發到同一個 reduce中處理
48             // 例如 <word1,<1,1>>在 reduce1 中處理,而<word2,<1>> 會在 reduce2 中處理
49             int sum = 0;
50             // 相同的key(單詞)的出現次數會被 sum 累加
51             for (IntWritable val : values) {
52                 sum += val.get();
53             }
54             result.set(sum);
55             // 1個 reduce 處理完1 個鍵值對後,會輸出其 key(單詞)對應的結果(出現次數)
56             context.write(key, result);
57         }
58     }
59  
60     public static void main(String[] args) throws Exception {
61         Configuration conf = new Configuration();
62         // 多隊列hadoop集羣中,設置使用的隊列
63         conf.set("mapred.job.queue.name", "regular");
64         // 之所以此處不直接用 argv[1] 這樣的,是爲了排除掉運行時的集羣屬性參數,例如隊列參數,
65         // 得到用戶輸入的純參數,如路徑信息等
66         String[] otherArgs = new GenericOptionsParser(conf, args)
67                 .getRemainingArgs();
68         if (otherArgs.length != 2) {
69             System.err.println("Usage: wordcount <in> <out>");
70             System.exit(2);
71         }
72         Job job = new Job(conf, "word count");
73         job.setJarByClass(WordCount.class);
74         // map、reduce 輸入輸出類
75         job.setMapperClass(TokenizerMapper.class);
76         job.setCombinerClass(IntSumReducer.class);
77         job.setReducerClass(IntSumReducer.class);
78         job.setOutputKeyClass(Text.class);
79         job.setOutputValueClass(IntWritable.class);
80         // 輸入輸出路徑
81         FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
82         FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
83         // 多子job的類中,可以保證各個子job串行執行
84         System.exit(job.waitForCompletion(true) ? 0 : 1);
85     }
86 }

執行:

1 bin/hadoop jar /tmp/wordcount.jar WordCount /tmp/3.txt /tmp/5

結果:

1 hadoop fs -cat /tmp/5/*
2 aa      1
3 bb      2
4 cc      2
5 dd      1

參考資料:

Hadoop - Map/Reduce 通過WordCount例子的變化來了解新版hadoop接口的變化

http://blog.csdn.net/derekjiang/article/details/6836209

Hadoop示例程序WordCount運行及詳解

http://samuschen.iteye.com/blog/763940

官方的 wordcount v1.0 例子

http://hadoop.apache.org/docs/r1.1.1/mapred_tutorial.html#Example%3A+WordCount+v1.0

2、基於 MR 的數據流 Like SQL 腳本開發語言:pig
1 A1 = load '/data/3.txt';
2 A = stream A1 through `sed "s/\t/ /g"`;
3 B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
4 C = filter B by word matches '\\w+';
5 D = group C by word;
6 E = foreach D generate COUNT(C), group;
7 dump E;

注意:不同分隔符對load及後面的$0的影響。

詳情請見:
https://gist.github.com/186460
http://www.slideshare.net/erikeldridge/a-brief-handson-introduction-to-hadoop-pig

3、構建數據倉庫的類 SQL 開發語言:hive
1 create table textlines(text string);
2 load data inpath '/data/3.txt' overwrite into table textlines;
3 SELECT wordColumn, count(1) FROM textlines LATERAL VIEW explode(split(text,'\t+')) wordTable AS wordColumn GROUP BY wordColumn;

詳情請見:

http://my.oschina.net/leejun2005/blog/83045
http://blog.csdn.net/techdo/article/details/7433222

4、跨平臺的腳本語言:python
map:
1 #!/usr/bin/python
2 import os,re,sys
3 for line in sys.stdin:
4     for i in line.strip().split("\t"):
5         print i

reduce:

01 #!/usr/bin/python
02 import os,re,sys
03 arr = {}
04 for words in sys.stdin:
05     word = words.strip()
06     if word not in arr:
07         arr[word] = 1
08     else:
09         arr[word] += 1
10 for k, v in arr.items():
11     print str(k) + ": " + str(v)

最後在shell下執行:

1 hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.203.0.jar -file map.py -file reduce.py  -mapper map.py -reducer reduce.py -input /data/3.txt -output /data/py

注意:腳本開頭需要顯示指定何種解釋器以及賦予腳本執行權限

詳情請見:
http://blog.csdn.net/jiedushi/article/details/7390015

5、Linux 下的瑞士軍刀:shell 腳本
map:
1 #!/bin/bash
2 tr '\t' '\n'

reduce:

1 #!/bin/bash
2 sort|uniq -c

最後在shell下執行:

01 june@deepin:~/hadoop/hadoop-0.20.203.0/tmp>
02 hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.203.0.jar -file map.py -file reduce.py  -mapper map.py -reducer reduce.py -input /data/3.txt -output /data/py
03 packageJobJar: [map.py, reduce.py, /home/june/data_hadoop/tmp/hadoop-unjar2676221286002400849/] [] /tmp/streamjob8722854685251202950.jar tmpDir=null
04 12/10/14 21:57:00 INFO mapred.FileInputFormat: Total input paths to process : 1
05 12/10/14 21:57:00 INFO streaming.StreamJob: getLocalDirs(): [/home/june/data_hadoop/tmp/mapred/local]
06 12/10/14 21:57:00 INFO streaming.StreamJob: Running job: job_201210141552_0041
07 12/10/14 21:57:00 INFO streaming.StreamJob: To kill this job, run:
08 12/10/14 21:57:00 INFO streaming.StreamJob: /home/june/hadoop/hadoop-0.20.203.0/bin/../bin/hadoop job  -Dmapred.job.tracker=localhost:9001 -kill job_201210141552_0041
09 12/10/14 21:57:00 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201210141552_0041
10 12/10/14 21:57:01 INFO streaming.StreamJob:  map 0%  reduce 0%
11 12/10/14 21:57:13 INFO streaming.StreamJob:  map 67%  reduce 0%
12 12/10/14 21:57:19 INFO streaming.StreamJob:  map 100%  reduce 0%
13 12/10/14 21:57:22 INFO streaming.StreamJob:  map 100%  reduce 22%
14 12/10/14 21:57:31 INFO streaming.StreamJob:  map 100%  reduce 100%
15 12/10/14 21:57:37 INFO streaming.StreamJob: Job complete: job_201210141552_0041
16 12/10/14 21:57:37 INFO streaming.StreamJob: Output: /data/py
17 june@deepin:~/hadoop/hadoop-0.20.203.0/tmp>
18 hadoop fs -cat /data/py/part-00000
19       1 aa 
20       1 bb 
21       1 bb 
22       2 cc 
23       1 dd 
24 june@deepin:~/hadoop/hadoop-0.20.203.0/tmp>


特別提示:上述有些方法對字段後的空格忽略或計算,請注意仔細甄別。


說明:列舉了上述幾種方法主要是給大家一個不同的思路,
在解決問題的過程中,開發效率、執行效率都是我們需要考慮的,不要太侷限某一種方法了。
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章