Hadoop 多表 join：map side join 範例

在沒有 pig 或者 hive 的環境下，直接在 mapreduce 中自己實現 join 是一件極其蛋疼的事情，MR中的join分爲好幾種，比如有最常見的 reduce side join，map side join，semi join 等。今天我們要討論的是第 2 種：map side join，這種 join 在處理多個小表關聯大表時非常有用，而 reduce join 在處理多表關聯時是比較麻煩的，一次只能處理一張表。

1、原理：

之所以存在reduce side join，是因爲在map階段不能獲取所有需要的join字段，即：同一個key對應的字段可能位於不同map中。但 Reduce side join是非常低效的，因爲shuffle階段要進行大量的數據傳輸。Map side join是針對以下場景進行的優化：兩個待連接表中，有一個表非常大，而另一個表非常小，以至於小表可以直接存放到內存中。這樣，我們可以將小表複製多份，讓每個map task內存中存在一份（比如存放到hash table中），然後只掃描大表：對於大表中的每一條記錄key/value，在hash table中查找是否有相同的key的記錄，如果有，則連接後輸出即可。爲了支持文件的複製，Hadoop提供了一個類DistributedCache，使用該類的方法如下：

（1）用戶使用靜態方法DistributedCache.addCacheFile()指定要複製的文件，它的參數是文件的URI（如果是HDFS上的文件，可以這樣：hdfs://jobtracker:50030/home/XXX/file）。JobTracker在作業啓動之前會獲取這個URI列表，並將相應的文件拷貝到各個TaskTracker的本地磁盤上。

（2）用戶使用DistributedCache.getLocalCacheFiles()方法獲取文件目錄，並使用標準的文件讀寫API讀取相應的文件。

2、環境：

本實例需要的測試文件及 hdfs 文件存放目錄如下：

hadoop fs -ls /test/decli
Found 4 items
-rw-r--r--   2 root supergroup        152 2013-03-06 02:05 /test/decli/login
drwxr-xr-x   - root supergroup          0 2013-03-06 02:45 /test/decli/output
-rw-r--r--   2 root supergroup         12 2013-03-06 02:12 /test/decli/sex
-rw-r--r--   2 root supergroup         72 2013-03-06 02:44 /test/decli/user

測試文件內容分別爲：

root@master 192.168.120.236 02:58:03 ~/test/table >
cat login # 登錄表，需要判斷 uid 列是否有效，並得到對應用戶名、性別、訪問次數
1       0       20121213
2       0       20121213
3       1       20121213
4       1       20121213
1       0       20121114
2       0       20121114
3       1       20121114
4       1       20121114
1       0       20121213
1       0       20121114
9       0       20121114
root@master 192.168.120.236 02:58:08 ~/test/table >
cat sex # 性別表
0       男
1       女
root@master 192.168.120.236 02:58:13 ~/test/table >
cat user # 用戶屬性表
1       張三    hubei
3       王五    tianjin
4       趙六    guangzhou
2       李四    beijing
root@master 192.168.120.236 02:58:16 ~/test/table >

測試環境 hadoop 版本：

`1`	`echo` `$HADOOP_HOME`

`2`	`/work/hadoop-0.20.203.0`

好了，廢話少說，上代碼：

3、代碼：

001
import 
java.io.BufferedReader;

002
import 
java.io.FileReader;

003
import 
java.io.IOException;

004
import 
java.util.HashMap;

005
import 
java.util.Map;

006
 

007
import 
org.apache.hadoop.conf.Configuration;

008
import 
org.apache.hadoop.conf.Configured;

009
import 
org.apache.hadoop.filecache.DistributedCache;

010
import 
org.apache.hadoop.fs.Path;

011
import 
org.apache.hadoop.io.LongWritable;

012
import 
org.apache.hadoop.io.Text;

013
import 
org.apache.hadoop.mapreduce.Job;

014
import 
org.apache.hadoop.mapreduce.Mapper;

015
import 
org.apache.hadoop.mapreduce.Reducer;

016
import 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

017
import 
org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

018
import 
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

019
import 
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

020
import 
org.apache.hadoop.util.GenericOptionsParser;

021
import 
org.apache.hadoop.util.Tool;

022
import 
org.apache.hadoop.util.ToolRunner;

023
 

024
public 
class MultiTableJoin extends
Configured implements
Tool {

025
    public
static class 
MapClass extends 
Mapper<LongWritable, Text, Text, Text> {

026
 

027
        // 用於緩存 sex、user 文件中的數據

028
        private
Map<String, String> userMap = new
HashMap<String, String>();

029
        private
Map<String, String> sexMap = new
HashMap<String, String>();

030
 

031
        private
Text oKey = new
Text();

032
        private
Text oValue = new
Text();

033
        private
String[] kv;

034
 

035
        // 此方法會在map方法執行之前執行

036
        @Override

037
        protected
void setup(Context context)
throws IOException,

038
                InterruptedException {

039
            BufferedReader in =
null;

040
 

041
            try
{

042
                // 從當前作業中獲取要緩存的文件

043
                Path[] paths = DistributedCache.getLocalCacheFiles(context

044
                        .getConfiguration());

045
                String uidNameAddr =
null;

046
                String sidSex =
null;

047
                for
(Path path : paths) {

048
                    if
(path.toString().contains("user")) {

049
                        in =
new BufferedReader(new
FileReader(path.toString()));

050
                        while
(null 
!= (uidNameAddr = in.readLine())) {

051
                            userMap.put(uidNameAddr.split("\t", -1)[0],

052
                                    
uidNameAddr.split("\t", -1)[1]);

053
                        }

054
                    }
else if
(path.toString().contains("sex")) {

055
                        in =
new BufferedReader(new
FileReader(path.toString()));

056
                        while
(null 
!= (sidSex = in.readLine())) {

057
                            sexMap.put(sidSex.split("\t", -1)[0],
sidSex.split(

058
                                    
"\t", -1)[1]);

059
                        }

060
                    }

061
                }

062
            }
catch (IOException e) {

063
                e.printStackTrace();

064
            }
finally {

065
                try
{

066
                    if
(in != null) {

067
                        in.close();

068
                    }

069
                }
catch (IOException e) {

070
                    e.printStackTrace();

071
                }

072
            }

073
        }

074
 

075
        public
void map(LongWritable key, Text value, Context context)

076
                throws
IOException, InterruptedException {

077
 

078
            kv = value.toString().split("\t");

079
            // map join: 在map階段過濾掉不需要的數據

080
            if
(userMap.containsKey(kv[0]) && sexMap.containsKey(kv[1])) {

081
                oKey.set(userMap.get(kv[0]) +
"\t" + sexMap.get(kv[1]));

082
                oValue.set("1");

083
                context.write(oKey, oValue);

084
            }

085
        }

086
 

087
    }

088
 

089
    public
static class 
Reduce extends 
Reducer<Text, Text, Text, Text> {

090
 

091
        // private Text oValue = new Text();

092
        // private StringBuilder sb;

093
 

094
        public
void reduce(Text key, Iterable<Text> values, Context context)

095
                throws
IOException, InterruptedException {

096
            int
sumCount = 0;

097
 

098
            for
(Text val : values) {

099
                sumCount += Integer.parseInt(val.toString());

100
            }

101
 

102
            context.write(key,
new Text(String.valueOf(sumCount)));

103
        }

104
 

105
    }

106
 

107
    public
int run(String[] args)
throws Exception {

108
        Job job =
new Job(getConf(),
"MultiTableJoin");

109
 

110
        job.setJobName("MultiTableJoin");

111
        job.setJarByClass(MultiTableJoin.class);

112
        job.setMapperClass(MapClass.class);

113
        job.setReducerClass(Reduce.class);

114
 

115
        job.setInputFormatClass(TextInputFormat.class);

116
        job.setOutputFormatClass(TextOutputFormat.class);

117
 

118
        job.setOutputKeyClass(Text.class);

119
        job.setOutputValueClass(Text.class);

120
 

121
        String[] otherArgs =
new GenericOptionsParser(job.getConfiguration(),

122
                args).getRemainingArgs();

123
 

124
        // 我們把第1、2個參數的地址作爲要緩存的文件路徑

125
        DistributedCache.addCacheFile(new
Path(otherArgs[1]).toUri(), job

126
                .getConfiguration());

127
        DistributedCache.addCacheFile(new
Path(otherArgs[2]).toUri(), job

128
                .getConfiguration());

129
 

130
        FileInputFormat.addInputPath(job,
new Path(otherArgs[3]));

131
        FileOutputFormat.setOutputPath(job,
new Path(otherArgs[4]));

132
 

133
        return
job.waitForCompletion(true) ?
0 : 1;

134
    }

135
 

136
    public
static void 
main(String[] args) throws 
Exception {

137
        int
res = ToolRunner.run(new
Configuration(), new
MultiTableJoin(),

138
                args);

139
        System.exit(res);

140
    }

141
 

142
}

運行命令：

`1`	`hadoop jar MultiTableJoin.jar MultiTableJoin /test/decli/sex /test/decli/user /test/decli/login /test/decli/output`

4、結果：

運行結果：

root@master 192.168.120.236 02:47:18 ~/test/table >
hadoop fs -cat /test/decli/output/*|column -t
cat: File does not exist: /test/decli/output/_logs
張三男 4
李四男 2
王五女 2
趙六女 2
root@master 192.168.120.236 02:47:26 ~/test/table >

TIPS：

更多關於 hadoop mapreduce 相關 join 介紹，請參考之前的博文：

MapReduce 中的兩表 join 幾種方案簡介

http://my.oschina.net/leejun2005/blog/95186

Hadoop 多表 join：map side join 範例

美團一面：項目中有 10000 個 if else 如何優化？想了半天，被問懵了！

京東面試：如何進行JVM調優？

Python 將PowerPoint (PPT/PPTX) 轉爲HTML

SQL優化-20231016

Python Lambda 形式

Java在Mac OS、Unix/Linux、Windows中文本文件的換行符

SecureCRT連接Linux顯示Mysql記錄中文亂碼

拒絕小白 Windows 7 32位與64位的區別

《有些事現在不做，一輩子都不會做了》（韓梅梅）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結