跟A君學大數據(三)--利用MapReduce對多文件數據進行排序

先來一個小插曲

MapReduce Job中的全局數據

在MapReduce中如何保存全局數據呢？可以考慮以下幾種方式

讀寫HDFS文件，即將變量存在一個地方
配置Job屬性，即將變量寫道配置（Configuration）中
使用DistributedCache，但是DistributedCache是隻讀的

排序

首先聯想MapReduce過程，先Map，給輸入，並給輸出。Reduce則是將結果處理進行計算。
在MapReduce過程中本身就有排序，MapReduce的默認排序是按照key值進行排序，如果key爲int的IntWritable，則按照大小排序，如果key爲String，則按照ascii 碼進行排序。
但是有個問題，Reducer自動排序的數據僅僅是發送到自己節點數據，使用默認的排序並不能保證全局的順序，因爲在排序前還有個partition的過程，即能保證內部順序性，而無法保證節點之間數據順序性。
那麼爲了完全實現內部節點之間的順序性，那麼就需要自定義Partition類，保證執行Partition過程之後所有Reduce上的數據在整體上是有序的。
本代碼以以下思路進行：

將讀入的數據轉化成IntWritable型，然後將值作爲key輸出（value任意）
重寫partition，保證整體有序，用輸入數據的最大值a除以系統partition數量b的商c作爲分割數據的邊界增量，也就是說分割數據的邊界是這個商c的1倍，2倍至partition-1倍。這樣就能保證執行partition後的數據是整體有序的。
Reduce獲得<key, value-list>後，根據value-list中元素個數，將輸入的key作爲value的輸出次數（即有相同的就輸出多個），輸出的key是一個全局變量，用於統計當前key的位次（即在所有數中排第幾）。

示例參數

file01:

file02

file03

示例輸出：

源碼

最後給出源碼，便於大家理解：

package com.anla.chapter2;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

import java.io.IOException;

/**
 * @user anLA7856
 * @time 19-3-22 下午4:38
 * @description
 */
public class Sort {

    public static class Map extends Mapper<Object, Text, IntWritable, IntWritable> {
        private static IntWritable data = new IntWritable();

        @Override
        protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();    // 因爲一行一個
            data.set(Integer.parseInt(line));
            context.write(data, new IntWritable(1));
        }
    }

    public static class Reduce extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
        private static IntWritable lineNum = new IntWritable(1);

        @Override
        protected void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            for (IntWritable val : values){
                context.write(lineNum, key);      // key 爲1,最終value爲key
                lineNum = new IntWritable(lineNum.get()+1);    // 簡單自增
            }
        }
    }

    public static class Partition extends Partitioner<IntWritable, IntWritable>{

        public int getPartition(IntWritable key, IntWritable value, int numPartitions) {
            int maxNumber = 65223;
            int bound = maxNumber / numPartitions + 1;
            int keyNumber = key.get();
            for (int i = 0;i < numPartitions; i++) {
                if (keyNumber > bound*numPartitions && keyNumber < bound *(numPartitions + 1)) {
                    return numPartitions;
                }
            }
            return -1;
        }
    }

    public static void main(String[] args) throws Exception{
        Configuration configuration = new Configuration();
        String[] otherArgs = new GenericOptionsParser(configuration, args).getRemainingArgs();
        if (otherArgs.length != 2) {
            System.out.println("Usage: Sort <in> <out>");
            System.exit(2);
        }
        Job job = Job.getInstance(configuration, "sort");
        job.setJarByClass(Sort.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);
        job.setPartitionerClass(Partition.class);
        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true) ? 0: 1);
    }
}

運行方法可參看博主上一篇文章：
跟A君學大數據(二)-手把手運行Hadoop的WordCount程序

參考資料：

Hadoop IN Action

跟A君學大數據(三)--利用MapReduce對多文件數據進行排序

MapReduce Job中的全局數據

排序

示例參數

源碼

工作中用到的腳本合集

微服務實踐Aspire項目發佈到遠程k8s集羣

通過f-string編寫簡潔高效的Python格式化輸出代碼

[轉帖]20個常用的Linux工具命令

[轉帖]PostgreSQL從小白到高手教程 - 第46講：poc-tpch測試

24-5-18 X

Spring IOC(三): refresh 分析 invokeBeanFactoryPostProcessors 過程

Mybatis 主鍵回顯 KeyGenerator原理

Mybatis 攔截器及 PageHelper分析

Mybatis的 SqlSessionFactory 初始化過程和SqlSession 初始化過程

Spring IOC（四）ConfigurationClassPostProcessor 用法分析

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結