跟A君學大數據(四)-用MapReduce實現表關聯

原創

2019-03-23 20:08

前言

前面使用MapReduce，可以進行單詞計數，單詞去重，數字排序等，那麼結合到數據庫應用，
如何實現表關聯呢？
MapReduce更像算法題，怎麼通過Map和Reduce這兩個步驟來實現關聯，得到所需數據呢？
例如有一張表，裏面兩個字段，child和parent，現在讓你找出裏面的grandChild和grandParent來。
以MySQL爲例，我們直接一行sql就可以解決：

select a.child,b.parent 
from child_parent a, child_parent b
where a.parent=b.child
order by a.child desc

那麼從MapReduce角度該如何設計Map以及Reduce函數呢？

設計

需要使得左表的parent和右表的child列相連接。
將paren設置爲key，而child作爲value進行輸出，作爲左表
再將同一對child和paren的child設爲key，而parent設置爲value作爲輸出。
給每個輸出增加標誌作爲區分左右表。
在Reduce函數的接受的結果中，每個key的value-list包含了grandchild和grandparen關係
取出每個key的value進行解析，將左表的child放到一個數組，右表的parent放到一個數組，最後做雙重循環迪卡爾集即可（就如sql語句中的笛卡爾集）
因爲在Reduce中，給出的是key相同的value_list，所以就是相當於上面sql的where a.parent=b.child

具體實現

package com.anla.chapter3.innerjoin;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

import java.io.IOException;
import java.util.Iterator;

/**
 * @user anLA7856
 * @time 19-3-22 下午6:01
 * @description
 */
public class SimpleJoin {
    public static int time = 0;

    public static class Map extends Mapper<Object, Text, Text, Text> {
        @Override
        protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            String childName;
            String parentName;
            String relationType;
            String line = value.toString();
            int i = 0;
            // 用來尋找分隔符
            String[] values = line.split(" ");
            if (!"child".equals(values[0])) {
                // 不爲child，即不計算第一行
                childName = values[0];
                parentName = values[1];
                relationType = "1";    // 左右表區分
                context.write(new Text(parentName), new Text(relationType+"+"+childName+"+"+parentName));   // 左表
                relationType = "2";    // 左右表區分
                context.write(new Text(childName), new Text(relationType+"+" + childName + "+" +parentName));   // 右表
            }
        }
    }

    public static class Reduce extends Reducer<Text, Text, Text, Text> {
        @Override
        protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
            if (time == 0) {   // 輸出表頭
                context.write(new Text("grandChild"), new Text("grandParent"));
                time ++;
            }
            int grandChildNum = 0;
            String grandChild[] = new String[10];
            int grandParentNum = 0;
            String grandParent[] = new String[10];
            Iterator iterator = values.iterator();
            while (iterator.hasNext()){
                String record = iterator.next().toString();
                int len = record.length();
                if (len == 0) {
                    continue;
                }
                char relationType = record.charAt(0);
                String childName = record.split("\\+")[1];
                String parentName = record.split("\\+")[2];
                // 左表
                if (relationType == '1') {
                    grandChild[grandChildNum] = childName;
                    grandChildNum ++;
                }else {
                    grandParent[grandParentNum] = parentName;
                    grandParentNum++;
                }

            }
            // grandChild和grandParent求迪卡爾
            if (grandChildNum != 0 && grandParentNum != 0) {
                for (int m = 0; m <grandChildNum; m++) {
                    for (int n = 0; n < grandParentNum; n++){
                        context.write(new Text(grandChild[m]), new Text(grandParent[n]));
                    }
                }
            }
        }

    }


    public static void main(String[] args) throws Exception{
        Configuration configuration = new Configuration();
        String[] otherArgs = new GenericOptionsParser(configuration, args).getRemainingArgs();
        if (otherArgs.length != 2) {
            System.out.println("Usage: Sort <in> <out>");
            System.exit(2);
        }
        Job job = Job.getInstance(configuration, "SimpleJoin");
        job.setJarByClass(SimpleJoin.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true) ? 0:1);
    }
}

還是按照前一篇運行方法：跟A君學大數據(二)-手把手運行Hadoop的WordCount程序

得到結果：

參考資料：

Hadoop In Action

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

跟A君學大數據(四)-用MapReduce實現表關聯

前言

設計

具體實現

python gdal 安裝使用（Windows， python 3.6.8）

Spring IOC(三): refresh 分析 invokeBeanFactoryPostProcessors 過程

Mybatis 主鍵回顯 KeyGenerator原理

Mybatis 攔截器及 PageHelper分析

Mybatis的 SqlSessionFactory 初始化過程和SqlSession 初始化過程

Spring IOC（四）ConfigurationClassPostProcessor 用法分析

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結