跟A君學大數據(二)-手把手運行Hadoop的WordCount程序

前一篇文章介紹了Hadoop的安裝以及簡單配置，博主以僞分佈式的方式安裝，即單機安裝極有master也有cluster。
本篇文章將展示如何運行經典的WordCount程序。

源代碼

首先例子源代碼如下：

package com.anla.chapter1;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.util.StringTokenizer;


/**
 * @user anLA7856
 * @time 19-3-21 下午10:30
 * @description
 */
public class WordCount {

    public static class TokenizerMapper
            extends Mapper<Object, Text, Text, IntWritable>{

        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        /**
         * 讀入文件，並標記爲<word, 1>
         * @param key
         * @param value
         * @param context
         * @throws IOException
         * @throws InterruptedException
         */
        public void map(Object key, Text value, Context context
        ) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());   // 因爲只有一行，所以直接第一個就好了
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }


    public static class IntSumReducer
            extends Reducer<Text,IntWritable,Text,IntWritable> {
        private IntWritable result = new IntWritable();

        /**
         * 將相同的key值，也就是word的value值收集起來，然後交由給Reduce處理，
         * Reduce將相同key值的value收集起來，形成<word, list of 1>的形式，之後將這些1加起來，
         * 即爲單詞個數。最後將這個<key,value>對TextOutputFormat的形式輸出HDFS中。
         * @param key
         * @param values
         * @param context
         * @throws IOException
         * @throws InterruptedException
         */
        public void reduce(Text key, Iterable<IntWritable> values,
                           Context context
        ) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
//        String file0 = "/input";
//        String file1 = "/output";
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "word count");    // 初始化
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);    // 設置mapper類
        job.setCombinerClass(IntSumReducer.class);    // 設置reducer
        job.setReducerClass(IntSumReducer.class);     // 設置reducer
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));     // 設置文件輸入路徑
        FileOutputFormat.setOutputPath(job, new Path(args[1]));    // 設置第二個擦數文件路徑
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

運行

運行方式

首先確保在hdfs中有相應的輸入文件目錄，用hadoop fs -ls / 查看
如果沒有，則需要創建hafs目錄，並且把例子文件file01和file02放入hdfs中

hadoop fs -mkdir /input
hadoop fs -put file0* /input

編譯WordCount類，注意，由於博主類名帶有包名，所以在編譯時候需要注意下

hadoop com.sun.tools.javac.Main -d . WordCount.java

將含有內部類文件，統一打成jar包

jar cf wc.jar com/anla/chapter1/WordCount*.class

最後，運行jar包

hadoop jar wc.jar com.anla.chapter1.WordCount /input /output

大功告成：
4. 此時，在hdfs下面，多了個/output，執行cat命令，可以看到輸出

hadoop fs -cat /output/part-r-00000

下面，簡單介紹下這個例子程序

分析

InputFormat

當數據傳送給Map時，Map會將輸入分片傳送到InputFormat上，InputFormat則調用getRecordReader方法生成RecordReader，
RecordReader再通過createKey、createValue創建可供Map處理的<key,value>，即<k1,v1>。
即，InputFormat是用來生成可供Map處理的<key,value>對的。
Hadoop預定義了很多方法將不同類型的輸入數據轉化爲Map能夠處理的<key,value>對。

就拿FileInputFormat來說，每行都會生成一條記錄，每條記錄規則表示成<key,value>形式。

key值是每個數據的記錄在數據分片的字節偏移量，數據類型是LongWritable
value是每行的內容，數據類型是Text
即數據會以如下形式傳入Map
file01
0 hello world bye world
file02
0 hello hadoop bye hadoop

OutputFormat

對於每一種輸入跟是都有一種輸出格式與其對應，默認的輸出格式是TextOutputFormat，會將每條記錄以一行的形式存入文本文件，不過，它的鍵和值
可以是任意形式的，因爲會調用toString方法輸出，最後形式爲：
bye 2
hadoop 2
hello 2
world 2

Map

Mapper接口是一個接受4個參數的泛型類型，分別是用來指定

輸入key值類型
輸入value值類型
輸出key值類型
輸出value值類型

Reduce

Reducer接口同樣接受四個泛型參數
而Reducer()方法以Map()的輸出作爲輸入，因此，Reducer的輸入類型是<Text,IneWritable>，而Reduce()的輸出是單詞和它的數目，因此爲<Text, IntWritable>。

代碼分析

通過Mapper，一次處理一行，即第一次map結果爲：

< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>

第二次map結果爲：

< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>

通過job.setCombinerClass(IntSumReducer.class); 來將map的結果聚合起來，使用IntSumReducer即相加起來，得到兩次map輸出爲：

< Bye, 1>
< Hello, 1>
< World, 2>

以及

< Goodbye, 1>
< Hadoop, 2>
< Hello, 1>

通過Reducer方法，將兩次Map進行運算並輸出結果，即相加，最終job結果爲：

< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>

跟A君學大數據(二)-手把手運行Hadoop的WordCount程序

源代碼

運行

運行方式

分析

InputFormat

OutputFormat

Map

Reduce

代碼分析

python gdal 安裝使用（Windows， python 3.6.8）

Spring IOC(三): refresh 分析 invokeBeanFactoryPostProcessors 過程

Mybatis 主鍵回顯 KeyGenerator原理

Mybatis 攔截器及 PageHelper分析

Mybatis的 SqlSessionFactory 初始化過程和SqlSession 初始化過程

Spring IOC（四）ConfigurationClassPostProcessor 用法分析

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結