Hadoop_MapReduce_InputFormat工作原理

Hadoop_MapReduce工作原理

六個階段:

  • Input 文件輸入
  • Splitting 分片
  • Mapping
  • Shuffling
  • Reducing
  • Final result

mapper的輸入數據爲KV對形式,每一個KV對都會調用map()方法,輸出數據也是KV對形式。

mapper從context中獲得輸入數據,將處理後的結果寫入context中(context.write(text, iw);),輸入(LongWritable, Text)和輸出(Text, IntWritable)的數據格式由用戶設置。

context通過RecordReader獲取輸入數據,通過RecordWriter保存mapper處理後的數據


InputFormat負責處理MR的輸入

InputFormat是一個抽象類,有以下幾個子類:

  • ComposableInputFormat
  • CompositeInputFormat
  • DBInputFormat
  • DelegatingInputFormat
  • FileInputFormat

InputFormat有三個方法:

  • InputFormat() :構造器
  • createRecordReader() :提供RecordReader的實現類,把切片讀到Mapper中進行處理。
  • getSplits() :把輸入文件進行切分

InputFormat的子類FileInputFormat還是一個抽象類,有以下幾個子類:

  • CombineFileInputFormat
  • FixedLengthInputFormat
  • KeyValueTextInputFormat
  • NLineInputFormat
  • SequenceFileInputFormat
  • TextInputFormat


TextInputFormat

TextInputFormat 是MapReduce默認的InputFormat,它是按行讀取每條記錄。
Key(LongWritable):用來存儲該行在整個文件中的起始字節偏移量
Value(Text):爲該行的內容。
TextInputFormat對文件切分的邏輯是使用父類(FileInputFormat)的 getSplits() 方法。
切片方式爲:對每個文件進行切分,默認的切片大小爲128M.



NLineInputFormat

切片方式:以文件N行作爲一個切片,默認一行一個切片。
KEY類型:LongWritable
VALUE類型:Text

示例:輸入12行數據,以3行爲一個切片,分成4個切片:

修改 Hadoop_WordCount單詞統計 工程

  1. 修改 MyWordCount.java
package com.blu.mywordcount;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class MyWordCount {
	
	public static void main(String[] args) {
		
		try {
			Configuration conf = new Configuration();
			Job job = Job.getInstance(conf);
			job.setJarByClass(MyWordCount.class);
			job.setMapperClass(MyWordCountMapper.class);
			job.setReducerClass(MyWordCountReducer.class);
			job.setMapOutputKeyClass(Text.class);
			job.setMapOutputValueClass(IntWritable.class);
			job.setOutputKeyClass(Text.class);
			job.setOutputValueClass(IntWritable.class);
			FileInputFormat.addInputPath(job, new Path(args[0]));
			FileOutputFormat.setOutputPath(job, new Path(args[1]));
			//指定劃分切片的行數
			NLineInputFormat.setNumLinesPerSplit(job, 3);
			//指定InputFormat的類型
			job.setInputFormatClass(NLineInputFormat.class);
			boolean flag = job.waitForCompletion(true);
			System.exit(flag ?0 : 1);
			
		} catch (Exception e) {
			e.printStackTrace();
		}
	}

}
  1. 在D:\data下的testdata.txt文件中寫入12行的數據:
good morning
good afternoon
good evening
zhangsan male
lisi female
wangwu male
good morning
good afternoon
good evening
zhangsan male
lisi female
wangwu male
  1. 設置以下參數運行MyWordCount的main方法
D:\data\testdata.txt D:\data\output
  1. 運行結果
afternoon	2
evening	2
female	2
good	6
lisi	2
male	4
morning	2
wangwu	2
zhangsan	2
  1. 控制檯打印切片數量爲4:
[INFO ] 2020-04-26 17:12:41,643 method:org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:204)
number of splits:4
  1. 修改的關鍵代碼:
//指定劃分切片的行數
NLineInputFormat.setNumLinesPerSplit(job, 3);
//指定InputFormat的類型
job.setInputFormatClass(NLineInputFormat.class);


KeyValueTextInputFormat
KEY類型:Text :以分隔符前的數據作爲key
VALUE類型:Text :以分隔符後的數據作爲value

示例,使用 KeyValueTextInputFormat 統計以下txt中人名出現的次數

D:\data\money.txt ( 注意該文件中每一行的人名與後面的數據的分割符爲Tab )

zhangsan	500 450 jan
lisi	200 150 jan
lilei	150 160 jan
zhangsan	500 500 feb
lisi	200 150 feb
lilei	150 160 feb
  1. 創建 Kvmapper 類
package com.blu.kvdemo;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

/**
 * 輸出格式:
 * zhangsan 1
 * lisi 1
 * zhangsan 1
 * 
 * @author BLU
 *
 */
public class Kvmapper extends Mapper<Text, Text, Text, IntWritable>{
	
	/**
	 * 輸入格式:
	 * zhangsan 500 450 jan
	 * key:zhangsan
	 * value:500 450 jan
	 */

	private IntWritable iw = new IntWritable(1);
	
	@Override
	protected void map(Text key, Text value, Mapper<Text, Text, Text, IntWritable>.Context context)
			throws IOException, InterruptedException {
		
		context.write(key, iw);
	}
}
  1. KvReducer類
package com.blu.kvdemo;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class KvReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
	
	IntWritable iw = new IntWritable();
	
	@Override
	protected void reduce(Text key, Iterable<IntWritable> value,
			Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
		
		int sum = 0;
		for(IntWritable iw : value) {
			sum += iw.get();
		}
		iw.set(sum);
		context.write(key, iw);
	}

}
  1. KeyValueDemo
package com.blu.kvdemo;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueLineRecordReader;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


public class KeyValueDemo {

	public static void main(String[] args) throws Exception {
		
		Job job = Job.getInstance();
		job.setInputFormatClass(KeyValueTextInputFormat.class);
		Configuration conf = new Configuration();
		//設置以tab爲分隔符
		conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR, "\t");
		job.setJarByClass(KeyValueDemo.class);
		job.setMapperClass(com.blu.kvdemo.Kvmapper.class);
		job.setReducerClass(com.blu.kvdemo.KvReducer.class);
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		boolean flag = job.waitForCompletion(true);
		System.exit(flag?0:1);
	}
}
  1. 設置以下參數運行KeyValueDemo的main方法
D:\data\money.txt D:\data\output
  1. 運行結果
lilei	2
lisi	2
zhangsan	2


CombineTextInputFormat

TextInputFormat 的切片機制是按文件切片,如果有大量的小文件,就會產生大量的MapTask,處理效率會很低。而CombineTextInputFormat可以將小文件合併爲一個切片進行處理。
CombineTextInputFormat的切片機制:

  1. 虛擬存儲過程

假設有以下四個文件:

 a.txt	1.7M
 b.txt	5.1M
 c.txt	3.4M
 d.txt	6.8M

假設 setMaxInputSplitSize的值爲4M
將所有文件依次與 setMaxInputSplitSize的值4M比較,如果小於4M,邏輯上劃分爲一塊。如果大於4M但小於8M,則文件均分爲兩塊。如果大於8M,則先以4M爲一塊,剩餘大小繼續比較。
分塊情況如下:

塊1:	1.7M
塊2:	2.55M
塊3:	2.55M
塊4:	3.4M
塊5:	3.4M
塊6:	3.4M
  1. 切片過程
    判斷虛擬存儲文件塊大小是否大於等於setMaxInputSplitSize的值(4M),如果大於等於4M,則單獨作爲一個切片。如果小於4M,則與下一個文件塊合併爲一個切片。
    最終形成3個切片:
切片1:	1.7M+2.55M
切片2:	2.55M+3.4M
切片3:	3.4M+3.4M

CombineTextInputFormat實例演示:

  1. 在D:\data下創建6個小文件:在這裏插入圖片描述
  2. 修改Hadoop_WordCount單詞統計工程的MyWordCount
package com.blu.mywordcount;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class MyWordCount {
	
	public static void main(String[] args) {
		
		try {
			Configuration conf = new Configuration();
			Job job = Job.getInstance(conf);
			job.setJarByClass(MyWordCount.class);
			job.setMapperClass(MyWordCountMapper.class);
			job.setReducerClass(MyWordCountReducer.class);
			job.setMapOutputKeyClass(Text.class);
			job.setMapOutputValueClass(IntWritable.class);
			job.setOutputKeyClass(Text.class);
			job.setOutputValueClass(IntWritable.class);
			FileInputFormat.addInputPath(job, new Path(args[0]));
			FileOutputFormat.setOutputPath(job, new Path(args[1]));
			boolean flag = job.waitForCompletion(true);
			System.exit(flag ?0 : 1);		
		} catch (Exception e) {
			e.printStackTrace();
		}
	}

}
  1. 用以下參數運行main方法:
D:\data\ D:\data\output
  1. 控制檯打印,輸入文件數爲6,切片數爲6(這是默認的TextInputFormat的切片方式):
Total input files to process : 6
number of splits:6
  1. 再次修改 MyWordCount 類:
package com.blu.mywordcount;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.CombineTextInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class MyWordCount {
	
	public static void main(String[] args) {
		
		try {
			Configuration conf = new Configuration();
			Job job = Job.getInstance(conf);
			job.setJarByClass(MyWordCount.class);
			job.setMapperClass(MyWordCountMapper.class);
			job.setReducerClass(MyWordCountReducer.class);
			job.setMapOutputKeyClass(Text.class);
			job.setMapOutputValueClass(IntWritable.class);
			job.setOutputKeyClass(Text.class);
			job.setOutputValueClass(IntWritable.class);
			//設置輸入的格式類爲CombineTextInputFormat
			job.setInputFormatClass(CombineTextInputFormat.class);
			//設置虛擬切片最大值爲1M
			CombineTextInputFormat.setMaxInputSplitSize(job, 1024*1024);
			FileInputFormat.addInputPath(job, new Path(args[0]));
			FileOutputFormat.setOutputPath(job, new Path(args[1]));
			boolean flag = job.waitForCompletion(true);
			System.exit(flag ?0 : 1);
			
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
	
}

關鍵代碼:

//設置輸入的格式類爲CombineTextInputFormat
job.setInputFormatClass(CombineTextInputFormat.class);
//設置虛擬切片最大值爲1M
CombineTextInputFormat.setMaxInputSplitSize(job, 1024*1024);
  1. 再次運行的結果,輸入文件爲6,切片數爲1:
Total input files to process : 6
number of splits:1


自定義InputFormat
步驟:
自定義一個類繼承FileInputFormat
重寫RecordReader

實例:過濾指定的單詞,不進行統計

修改Hadoop_WordCount單詞統計工程

  1. MyInputFormat 類:
package com.blu.mywordcount.inputformat;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

public class MyInputFormat extends FileInputFormat<LongWritable, Text>{

	@Override
	public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context)
			throws IOException, InterruptedException {
		return new myRecordReader(context.getConfiguration());
	}
	
}
  1. myRecordReader類:
package com.blu.mywordcount.inputformat;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;

public class myRecordReader extends RecordReader<LongWritable, Text> {
	
	public static String CUSTOM_KEYWORD="mapreduce.input.myRecordReader.line.keyword";
	private LineRecordReader lineRecordReader;
	//要過濾的單詞
	private String keyword;
	private LongWritable key;
	private Text value;
	
	public myRecordReader() {
		super();
	}

	public myRecordReader(Configuration conf) {
		lineRecordReader = new LineRecordReader();
		keyword = conf.get(CUSTOM_KEYWORD);
	}
	
	/**
	 * 初始化方法
	 */
	@Override
	public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
		lineRecordReader.initialize(split, context);
	}
	
	/**
	 * 主要邏輯
	 * 返回值true表示繼續獲取後面的數據
	 * 返回值false表示停止獲取後面的數據
	 */

	@Override
	public boolean nextKeyValue() throws IOException, InterruptedException {
		//判斷是否還有數據,沒有數據就停止繼續讀取
		if(!lineRecordReader.nextKeyValue()) {
			return false;
		}
		//獲得一行數據
		Text currentValue = lineRecordReader.getCurrentValue();
		//判斷這一行數據中是否包含要過濾的單詞
		String val = currentValue.toString();
		if(keyword != null) {
			if(val.contains(keyword)) {
				val = val.replace(keyword+" ", "");
				currentValue.set(val);
			}
		}
		key = lineRecordReader.getCurrentKey();
		value = currentValue;
		return true;
	}

	/**
	 * 返回當前行的Key的值
	 */
	@Override
	public LongWritable getCurrentKey() throws IOException, InterruptedException {
		return key;
	}

	/**
	 * 返回當前行的Value的值
	 */
	@Override
	public Text getCurrentValue() throws IOException, InterruptedException {
		return value;
	}

	@Override
	public float getProgress() throws IOException, InterruptedException {
		return 0;
	}

	@Override
	public void close() throws IOException {
		
	}
}
  1. 修改MyWordCount類:
package com.blu.mywordcount;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import com.blu.mywordcount.inputformat.MyInputFormat;
import com.blu.mywordcount.inputformat.myRecordReader;

public class MyWordCount {
	
	public static void main(String[] args) {
		
		try {
			Configuration conf = new Configuration();
			//設置要過濾的單詞
			conf.set(myRecordReader.CUSTOM_KEYWORD, "zhangsan");
			Job job = Job.getInstance(conf);
			job.setJarByClass(MyWordCount.class);
			job.setMapperClass(MyWordCountMapper.class);
			job.setReducerClass(MyWordCountReducer.class);
			job.setMapOutputKeyClass(Text.class);
			job.setMapOutputValueClass(IntWritable.class);
			job.setOutputKeyClass(Text.class);
			job.setOutputValueClass(IntWritable.class);
			//設置自定義的輸入類
			job.setInputFormatClass(MyInputFormat.class);
			FileInputFormat.addInputPath(job, new Path(args[0]));
			FileOutputFormat.setOutputPath(job, new Path(args[1]));
			boolean flag = job.waitForCompletion(true);
			System.exit(flag ?0 : 1);
			
		} catch (Exception e) {
			e.printStackTrace();
		}
	}

}

  1. D:\data\testdata.txt的內容:
good morning
good afternoon
good evening
zhangsan male
lisi female
wangwu male
good morning
good afternoon
good evening
zhangsan male
lisi female
wangwu male
  1. 用以下參數運行
D:\data\testdata.txt D:\data\output
  1. 結果
afternoon	2
evening	2
female	2
good	6
lisi	2
male	4
morning	2
wangwu	2
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章