MapReduce分佈式計算包含兩個階段:Mapper和Reduce。一個完整的MapReduce程序在分佈式計算時包括三類實例進程:
- MrAppMaster:負責整個程序的過程調度及狀態協調;
- MapTask:負責Map階段整個數據處理流程;
- ReduceTask:負責Reduce階段整個數據處理流程。
MapReduce的兩個階段中:第一個階段的MapTask併發實例,完全並行運行,互不相干;第二個階段的ReduceTask併發實例互不相干,但是他們的數據依賴於上一個階段的所有MapTask併發實例的輸出。Reduce階段需要等Map階段處理完畢纔會執行,所以大量數據要臨時放在內存中。
案例:下面通過MapReduce實現統計一個文件中單詞出現的頻率。
一、Mapper階段
- 用戶自定義的Mapper要集成hadoop提供的Mapper類;
- Mapper的輸入數據形式是KV鍵值對的形式,KV的類型可以自定義;
- Mapper的處理流程寫在map方法中;
- Mapper的輸出時KV鍵值對的形式,KV類型可以自定義;
- MapTask進程執行map方法時,對每一個KV調用一次,例如如果指定KV爲<LongWritable, Text>,LongWritable表示每行,Text表示每行的內容,MapTask會對每行內容調用一次map方法。
```java
package com.lzj.hadoop.wordcount;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
/*
* LongWritable - 表示讀取第幾行
* Text - 表示讀取一行的內容
* Text - 表示輸出的鍵
* IntWritable - 表示輸出的鍵對應的個數
* */
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
Text k = new Text(); //鍵
IntWritable v = new IntWritable(1); //值(值爲1)
/*context設置輸出*/
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
/*1、獲取讀取一行的內容*/
String line = value.toString();
/*2、按空格切割讀取的單詞*/
String[] words = line.split(" ");
/*3、輸出mapper處理完的內容*/
for(String word : words) {
/*給鍵設置值*/
k.set(word);
/*把mapper處理後的鍵值對寫到context中*/
context.write(k, v);
}
}
}
二、Reduce階段
- 用戶自定義的Reducer類要繼承hadoop提供的Reducer基類;
- Reducer的輸入數據對應Mapper的輸出數據,即爲KV對形式;
- Reducer的處理流程寫在reduce方法中;
- ReduceTask對每一組KV相同的K相同的執行一次reduce方法。
package com.lzj.hadoop.wordcount;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
/*
* Text - 輸入的鍵(即Mapper階段輸出的鍵)
* IntWritable - 輸入的值(個數)(即Mapper階段輸出的值)
* Text - 輸出的鍵
* IntWritable - 輸出的值
* */
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
/*text爲輸入的鍵,value爲輸入的內容*/
@Override
protected void reduce(Text text, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
/*1、統計鍵對應的個數*/
int sum = 0;
for(IntWritable value : values) {
sum = sum + value.get();
}
/*2、設置reducer的輸出*/
IntWritable v = new IntWritable();
v.set(sum);
context.write(text, v);
}
}
三、Driver階段
相當於yarn集羣的客戶端,用於提交我們整個程序到yarn集羣,提交的是封裝了MapReduce程序相關運行參數的job對象。
package com.lzj.hadoop.wordcount;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCountDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
/*1、獲取job的配置信息*/
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
/*2、設置jar的加載路徑*/
job.setJarByClass(WordCountDriver.class);
/*3、分別設置Mapper和Reducer類*/
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
/*4、設置map的輸出類型*/
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
/*5、設置最終輸出的鍵值類型*/
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
/*6、設置輸入輸出路徑*/
FileInputFormat.setInputPaths(job, new Path("D:/tmp/WordCountIn.txt"));
FileOutputFormat.setOutputPath(job, new Path("D:/tmp/WordCountOut"));
/*7、提交任務*/
boolean flag = job.waitForCompletion(true);
System.out.println("flag : " + flag);
}
}
測試,WordCountIn.txt文件中內容爲:
zhang xue you
xie ting feng
zhang xin zhe
yu cheng qing
xiong nai jin
bao jian feng
zhang jie
zhang san feng
執行程序之後,WordCountOut目錄下,打開part-r-00000,內容如下:
bao 1
cheng 1
feng 3
jian 1
jie 1
jin 1
nai 1
qing 1
san 1
ting 1
xie 1
xin 1
xiong 1
xue 1
you 1
yu 1
zhang 4
zhe 1
四、集羣測試
下面把上述代碼打包成jar,放到hadoop集羣上進行運行,修改WordCountDriver類中輸入文件和輸出文件路徑,設置成參數形式,便於測試
/*6、設置輸入輸出路徑*/
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
通過maven打包成jar,pom依賴如下:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.lzj</groupId>
<artifactId>hdfs</artifactId>
<version>0.0.1-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>RELEASE</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.8.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.2</version>
</dependency>
<dependency>
<groupId>jdk.tools</groupId>
<artifactId>jdk.tools</artifactId>
<version>1.8</version>
<scope>system</scope>
<systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.3.2</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin </artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass>com.lzj.hadoop.wordcount.WordCountDriver</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
執行maven install, 把生成的jar重命名成wordcount.jar,放到集羣上,然後把測試的輸入文件WordCountIn.txt文件放置到HDFS上的/user/lzj目錄下,下面在hadoop集羣中執行該jar
hadoop jar wordcount.jar com.lzj.hadoop.wordcount.WordCountDriver /user/lzj/WordCountIn.txt /user/lzj/output
執行成功後會在hdfs上/user/lzj/output目錄下生成結果。