10、Hadoop切片機制


一個超大文件在HDFS上存儲時,是以多個Block存儲在不同的節點上,比如一個512M的文件,HDFS默認一個Block爲128M,那麼1G的文件分成4個Block存儲在集羣中4個節點上。
Hadoop在map階段處理上述512M的大文件時分成幾個MapTask進行處理呢?Hadoop的MapTask並行度與數據切片有有關係,數據切片是對輸入的文件在邏輯上進行分片,對文件切成多少份,Hadoop就會分配多少個MapTask任務進行並行執行該文件,原理如下圖所示。
Block與Splite區別:Block是HDFS物理上把數據分成一塊一塊;數據切片只是在邏輯上對輸入進行分片,並不會在磁盤上將其切分成片進行存儲。如下圖所示,一個512M的文件在HDFS上存儲時,默認一個block爲128M,那麼該文件需要4個block進行物理存儲;若對該文件進行切片,假設以100M大小進行切片,該文件在邏輯上需要切成5片,則需要5個MapTask任務進行處理。
在這裏插入圖片描述

一、數據切片源碼詳解

  /** 
	   * Generate the list of files and make them into FileSplits.
	   * @param job the job context
	   * @throws IOException
	   */
	  public List<InputSplit> getSplits(JobContext job) throws IOException {
	    StopWatch sw = new StopWatch().start();
	    /*
	     * 	1、minSize默認最小值爲1
	     *     maxSize默認最大值爲9,223,372,036,854,775,807‬
	     * */
	    long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
	    long maxSize = getMaxSplitSize(job);

	    // generate splits
	    List<InputSplit> splits = new ArrayList<InputSplit>();
	    /*
	     *   2、獲取所有需要處理的文件
	     * */
	    List<FileStatus> files = listStatus(job);
	    for (FileStatus file: files) {
	      Path path = file.getPath();
	      /*
	       *   3、獲取文件的大小
	       * */
	      long length = file.getLen();
	      if (length != 0) {
	        BlockLocation[] blkLocations;
	        if (file instanceof LocatedFileStatus) {
	          /*
	           * 4、獲取文件的block,比如一個500M的文件,默認一個Block爲128M,500M的文件會分佈在4個DataNode節點上進行存儲
	           * */
	          blkLocations = ((LocatedFileStatus) file).getBlockLocations();
	        } else {
	        	/*
	        	 * 5、Hadoop如不特殊指定,默認用的HDFS文件系統,只會走上面if分支
	        	 * */
	          FileSystem fs = path.getFileSystem(job.getConfiguration());
	          blkLocations = fs.getFileBlockLocations(file, 0, length);
	        }
	        if (isSplitable(job, path)) {
	          /*
	           * 6、獲取Block塊的大小,默認爲128M
	           * */
	          long blockSize = file.getBlockSize();
	          /*
	           * 7、計算spliteSize分片的尺寸,首先取blockSize與maxSize之間的最小值即blockSize,
	                          *         然後取blockSize與minSize之間的最大值,即爲blockSize=128M,所以分片尺寸默認爲128M
	           * */
	          long splitSize = computeSplitSize(blockSize, minSize, maxSize);

	          long bytesRemaining = length;
	          /*
	           * 8、計算分片file文件可以在邏輯上劃分爲多少個數據切片,並把切片信息加入到List集合中
	           * */
	          while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
	            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
	            splits.add(makeSplit(path, length-bytesRemaining, splitSize,
	                        blkLocations[blkIndex].getHosts(),
	                        blkLocations[blkIndex].getCachedHosts()));
	            bytesRemaining -= splitSize;
	          }

	          /*
	           * 9、如果文件最後一個切片不滿128M,單獨切分到一個數據切片中
	           * */
	          if (bytesRemaining != 0) {
	            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
	            splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
	                       blkLocations[blkIndex].getHosts(),
	                       blkLocations[blkIndex].getCachedHosts()));
	          }
	        } else { // not splitable
	          /*
	           * 10、如果文件不可以切分,比如壓縮文件,會創建一個數據切片
	           * */
	          splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),
	                      blkLocations[0].getCachedHosts()));
	        }
	      } else { 
	        //Create empty hosts array for zero length file
	    	/*
	    	 * 11、如果爲空文件,創建一個空的數據切片
	    	 * */
	        splits.add(makeSplit(path, 0, length, new String[0]));
	      }
	    }
	    // Save the number of input files for metrics/loadgen
	    job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
	    sw.stop();
	    if (LOG.isDebugEnabled()) {
	      LOG.debug("Total # of splits generated by getSplits: " + splits.size()
	          + ", TimeTaken: " + sw.now(TimeUnit.MILLISECONDS));
	    }
	    return splits;
	  }

二、數據切片機制

1、TextInputFormat切片機制

切片方式:TextInputFormat是默認的切片機制,按文件規劃進行切分。比如切片默認爲128M,如果一個文件爲200M,則會形成兩個切片,一個是128M,一個是72M,啓動兩個MapTask任務進行處理任務。但是如果一個文件只有1M,也會單獨啓動一個MapTask執行此任務,如果是10個這樣的小文件,就會啓動10個MapTask處理小文件任務。
讀取方式:TextInputFormat是按行讀取文件的每條記錄,key代表讀取的文件行在該文件中的起始字節偏移量,key爲LongWritable類型;value爲讀取的行內容,不包括任何行終止符(換行符/回車符), value爲Text類型,相當於java中的String類型。
例如

Birds of a feather flock together
Barking dogs seldom bite
Bad news has wings

用TextInputFormat按每行讀取文件時,對應的key和value分別爲:

0,Birds of a feather flock together)
(34,Barking dogs seldom bite)
(59,Bad news has wings)

Demo:下面測試案例已統計單詞爲測試案例,處理文件爲D:\tmp\word\in 目錄下的4個文件。在這裏插入圖片描述
建立對應的Mapper類WordCountMapper

package com.lzj.hadoop.input;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

/*
 * LongWritable - 表示讀取第幾行
 * Text 		-  表示讀取一行的內容
 * Text			- 表示輸出的鍵
 * IntWritable 	- 表示輸出的鍵對應的個數
 * */
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
	
	Text k = new Text();
	IntWritable v = new IntWritable(1);
	
	@Override
	protected void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {
		//1、讀取一行內容
		String line = value.toString();
		if(line.isEmpty()) {
			return;
		}
		//2、按空格切割讀取的單詞
		String[] words = line.split(" ");
		//3、輸出mapper處理完的內容
		for(String word : words) {
			/*給鍵設置值*/
			k.set(word); 
			/*把mapper處理後的鍵值對寫到context中*/
			context.write(k, v);
		}
		
	}
	
}

建立對應的Reducer類:

package com.lzj.hadoop.input;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

/*
 * Text 		-  輸入的鍵(即Mapper階段輸出的鍵)
 * IntWritable 	- 輸入的值(個數)(即Mapper階段輸出的值)
 * Text 		- 輸出的鍵
 * IntWritable 	- 輸出的值
 * */
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
	@Override
	protected void reduce(Text text, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
		//1、統計鍵對應的個數
		int sum = 0;
		for(IntWritable value : values) {
			sum = sum + value.get();
		}
		//2、設置reducer的輸出
		IntWritable v = new IntWritable(sum);
		context.write(text, v);
	}
}

建立驅動類drive

/*測試TextInputFormat*/
public void testTextInputFormat() throws IOException, ClassNotFoundException, InterruptedException{
	//1、獲取job的配置信息
	Configuration conf = new Configuration();
	Job job = Job.getInstance(conf);
	//2、設置jar的加載路徑
	job.setJarByClass(WordCountDriver.class);
	//3、分別設置Mapper和Reducer類
	job.setMapperClass(WordCountMapper.class);
	job.setReducerClass(WordCountReducer.class);
	//4、設置map的輸出類型
	job.setMapOutputKeyClass(Text.class);
	job.setMapOutputValueClass(IntWritable.class);
	//5、設置最終輸出的鍵值類型
	job.setOutputKeyClass(Text.class);
	job.setOutputValueClass(IntWritable.class);
	//6、設置輸入輸出路徑
	FileInputFormat.setInputPaths(job, new Path("D:/tmp/word/in"));
	FileOutputFormat.setOutputPath(job, new Path("D:/tmp/word/out"));
	//7、提交任務
	boolean flag = job.waitForCompletion(true);
	System.out.println("flag ; " + flag);
}

啓動測試,在輸出的日誌信息中會有如下一行內容:

number of splits:4

處理的4個文件1.txt、2.txt、3.txt、4.txt分別小於128M,每一個文件會被切成一個split。

2、CombineTextInputFormat切片機制

如果要處理的任務中含有很多小文件,採用默認的TextInputFormat切片機制會啓動多個MapTask任務處理文件,浪費資源。CombineTextInputFormat用於處理小文件過多的場景,它可以將多個小文件從邏輯上切分到一個切片中。CombineTextInputFormat在形成切片過程中分爲虛擬存儲過程和切片過程兩個過程。

(1)虛擬存儲過程
將輸入目錄下所有文件大小,依次和設置的setMaxInputSplitSize值比較,如果不大於設置的最大值,邏輯上劃分一個塊。如果輸入文件大於設置的最大值且大於兩倍,那麼以最大值切割一塊;當剩餘數據大小超過設置的最大值且不大於最大值2倍,此時將文件均分成2個虛擬存儲塊(防止出現太小切片)。
例如setMaxInputSplitSize值爲4M,輸入文件大小爲8.02M,則先邏輯上分成一個4M。剩餘的大小爲4.02M,如果按照4M邏輯劃分,就會出現0.02M的小的虛擬存儲文件,所以將剩餘的4.02M文件切分成(2.01M和2.01M)兩個文件。
(2)切片過程
判斷虛擬存儲的文件大小是否大於setMaxInputSplitSize值,大於等於則單獨形成一個切片;
如果不大於則跟下一個虛擬存儲文件進行合併,共同形成一個切片。

下面以“D:\\tmp\\word\\in”目錄下的1.txt(576K)、2.txt(1151K)、3.txt(2302K)、4.txt(4604K)爲例,比如設置虛擬存儲切片setMaxInputSplitSize爲2M,1.txt 大小576K小於2M,形成一個存儲塊,2.txt大小1151K也小於2M,形成一個存儲塊,3.txt大小2302K大於2M,但小於4M,形成兩個存儲塊,分別爲1151K,4.txt大小4604K大於4M,形成一個2M的存儲塊後,還剩4604-1024*2=2556K,2556K大於2M,小於4M,分別形成2個1278K的存儲塊,  在存儲過程會形成6個文件塊,分別爲:

576K、1151K、(1151K,1151K)、(2048K、1278K、1278K)

在切片過程中,前3個存儲塊和爲576K + 1151K + 1151K = 2878K > 2M,形成一個切片;
第4和第5個存儲塊和爲:1151K + 2048K = 3199K > 2M,形成一個切片;最後兩個存儲塊和爲:1278K + 1278K = 2556K > 2M,形成一個切片,最終在切片過程中,4個文件形成了3個切片,啓動三個MapTask任務進行處理文件。

Demo:採用上述D:\tmp\word\in目錄下的文件進行測試
WordCountMapper和WordCountReducer同上例,驅動類如下

/*測試CombineTextInputFormat*/
public void testCombineTextInputFormat() throws IOException, ClassNotFoundException, InterruptedException {
	//1、獲取job的配置信息
	Configuration conf = new Configuration();
	Job job = Job.getInstance(conf);
	//2、設置jar的加載路徑
	job.setJarByClass(WordCountDriver.class);
	//3、分別設置Mapper和Reducer類
	job.setMapperClass(WordCountMapper.class);
	job.setReducerClass(WordCountReducer.class);
	//4、設置map的輸出類型
	job.setMapOutputKeyClass(Text.class);
	job.setMapOutputValueClass(IntWritable.class);
	//5、設置最終輸出的鍵值類型
	job.setOutputKeyClass(Text.class);
	job.setOutputValueClass(IntWritable.class);
	//6、設置輸入輸出路徑
	FileInputFormat.setInputPaths(job, new Path("D:\\tmp\\word\\in"));
	FileOutputFormat.setOutputPath(job, new Path("D:\\tmp\\word\\out"));
	//7、設置數據切分方式
	job.setInputFormatClass(CombineTextInputFormat.class);
	CombineTextInputFormat.setMaxInputSplitSize(job, 2097152); //2M
	//8、提交任務
	boolean flag = job.waitForCompletion(true);
	System.out.println("flag ; " + flag);
}

啓動測試類,日誌輸出中會有如下內容:

number of splits:3

3、KeyValueTextInputFormat切片機制

KeyValueTextInputFormat與TextInputFormat相似,按行讀入記錄,每個文件形成一個切片,但KeyValueTextInputFormat在讀入一行後可以指定切割符,把一行內容按切割符分割成鍵值對的形式。例如

A-this is a
B-this is b
C-this is c
C-this is c

經過mapper階段後被切割成:

(A,this is a)
(B,this is b)
(C,this is c)
(C,this is c)

下面統計每行開頭爲相同字母的個數。
Mapper類爲:

package com.lzj.hadoop.input;

import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

/*
 * LongWritable - 表示讀取第幾行
 * Text 		-  表示讀取一行的內容
 * Text			- 表示輸出的鍵
 * IntWritable 	- 表示輸出的鍵對應的個數
 * */
public class WordCountMapper extends Mapper<Text, Text, Text, LongWritable>{
	
	LongWritable v = new LongWritable(1);
	
	@Override
	protected void map(Text key, Text value, Context context)
			throws IOException, InterruptedException {
		//1、讀取一行內容
		String line = value.toString();
		if(line.isEmpty()) {
			return;
		}
		//2、按空格切割讀取的單詞
		context.write(key, v);
		
	}
	
}

Reducer類爲:


```java
package com.lzj.hadoop.input;

import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

/*
 * Text 		-  輸入的鍵(即Mapper階段輸出的鍵)
 * IntWritable 	- 輸入的值(個數)(即Mapper階段輸出的值)
 * Text 		- 輸出的鍵
 * IntWritable 	- 輸出的值
 * */
public class WordCountReducer extends Reducer<Text, LongWritable, Text, LongWritable>{
	@Override
	protected void reduce(Text text, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
		//1、統計鍵對應的個數
		long sum = 0;
		for(LongWritable value : values) {
			sum = sum + value.get();
		}
		//2、設置reducer的輸出
		LongWritable v = new LongWritable(sum);
		context.write(text, v);
	}
}

Driver驅動類爲:

/*測試keyvaleTextInputFormat*/
public static void testkeyValeTextInputFormat() throws IOException, ClassNotFoundException, InterruptedException {
	//1、獲取job的配置信息
	Configuration conf = new Configuration();
	Job job = Job.getInstance(conf);
	//2、設置jar的加載路徑
	job.setJarByClass(WordCountDriver.class);
	//3、分別設置Mapper和Reducer類
	job.setMapperClass(WordCountMapper.class);
	job.setReducerClass(WordCountReducer.class);
	//4、設置map的輸出類型
	job.setMapOutputKeyClass(Text.class);
	job.setMapOutputValueClass(LongWritable.class);
	//5、設置最終輸出的鍵值類型
	job.setOutputKeyClass(Text.class);
	job.setOutputValueClass(LongWritable.class);
	//6、設置輸入輸出路徑
	FileInputFormat.setInputPaths(job, new Path("D:\\tmp\\word\\in1/1.txt"));
	FileOutputFormat.setOutputPath(job, new Path("D:\\tmp\\word\\out6"));
	//7、設置數據切分方式
	job.setInputFormatClass(KeyValueTextInputFormat.class);
	conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR, "-");
	//8、提交任務
	boolean flag = job.waitForCompletion(true);
	System.out.println("flag ; " + flag);
}

啓動測試,輸出切片個數爲1

4、NLineInputFormat切片機制

NLineInputFormat可以指定切分文件時按指定的行數進行切分,比如文件總行數爲n,切分行數爲N,則切片數爲:如果n/N整除,切片數爲n/N;如果不能整除,切片數爲(n/N + 1)。以下面測試文件爲例:

There is no royal road to learning
It is never too old to learn
A man becomes learned by asking questions
Absence makes the heart grow fonder
When the cat is away, the mice will play
No cross, no crown
Ill news travels fast
He that climbs high falls heavily
From saving comes having
Experience is the mother of wisdom
East or west, home is best
Don't teach your grandmother to suck eggs
Don't trouble trouble until trouble troubles you
Doing is better than saying 
Birds of a feather flock together
Barking dogs seldom bite
Bad news has wings
As the tree, so the fruit
An idle youth, a needy age

文件共有19行,假設設置切片行數爲5,即每5行形成一個切片,可以分成 19/5+1=5個切片。Mapper在讀入文件時與TextInputFormat相同,按每行讀取記錄,對應的鍵key爲該行內容在文件中的偏移量,對應的值value爲該行具體內容。例如

0,There is no royal road to learning)
(35,It is never too old to learn)
(64,A man becomes learned by asking questions)
	……

統計該測試文件中單詞數案例如下
建立Mapper類:

package com.lzj.hadoop.input.nline;

import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class NLineInputFormatMapper extends Mapper<LongWritable, Text, Text, LongWritable> {

	Text k = new Text();
	LongWritable v = new LongWritable(1);
	
	@Override
	protected void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {
		//1、獲取一行內容
		String line = value.toString();
		//2、切割行
		String[] words = line.split(" ");
		//3、循環寫出
		for(String word : words) {
			k.set(word);
			context.write(k, v);
		}
		
	}
}

建立Reducer類:

package com.lzj.hadoop.input.nline;

import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class NLineInputFormatReducer extends Reducer<Text, LongWritable, Text, LongWritable> {

	LongWritable v = new LongWritable();
	
	@Override
	protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
		long sum = 0;
		for(LongWritable value : values) {
			sum = sum + value.get();
		}
		v.set(sum);
		context.write(key, v);
	}
}

建立Driver測試類:

package com.lzj.hadoop.input.nline;

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class NLineInputFormatDriver {

	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
		//1、獲取job的配置信息
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);
		//2、設置jar的加載路徑
		job.setJarByClass(NLineInputFormatDriver.class);
		//3、分別設置Mapper和Reducer類
		job.setMapperClass(NLineInputFormatMapper.class);
		job.setReducerClass(NLineInputFormatReducer.class);
		//4、設置map的輸出類型
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(LongWritable.class);
		//5、設置最終輸出的鍵值類型
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(LongWritable.class);
		//6、設置輸入輸出路徑
		FileInputFormat.setInputPaths(job, new Path("D:\\tmp\\word\\in2"));
		FileOutputFormat.setOutputPath(job, new Path("D:\\tmp\\word\\out7"));
		//7、設置切分方式
		job.setInputFormatClass(NLineInputFormat.class);
		NLineInputFormat.setNumLinesPerSplit(job, 5);
		//8、提交任務
		boolean flag = job.waitForCompletion(true);
		System.out.println("flag ; " + flag);
	}

}

啓動測試類,日誌中會輸出切片的個數:

number of splits:4

5、自定義InputFormat切片機制

除了上述hadoop自帶的切片機制,還可以自定義切片機制滿足定製開發。自定義InputFormat切片機制時需要自定義一個RecorderReader用於讀取文件,需要自定義一個InputFormat用於設置切文件輸入切分方式,然後後續開發如同上述切片機制開發一樣,創建Mapper、Reducer、driver類即可。
下面以將3個小文件合併成一個大文件爲例
首先,定製RecordReader類

package com.lzj.hadoop.input.custom;

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

public class CustomRecordReader extends RecordReader<Text, BytesWritable>{

	private FileSplit split;
	private Configuration conf;
	private Text key = new Text();
	private BytesWritable value = new BytesWritable();
	private Boolean isProgress = true;
	
	@Override
	public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
		this.split = (FileSplit) split;
		conf = context.getConfiguration();
	}

	@Override
	public boolean nextKeyValue() throws IOException, InterruptedException {
		if(isProgress) {
			FSDataInputStream inputStream = null;
			try {
				/*1、獲取文件系統*/
				Path path = split.getPath();
				FileSystem fs = path.getFileSystem(conf);
				/*2、獲取數據輸入流*/
				inputStream = fs.open(path);
				/*3、讀取文件內容*/
				byte[] buf = new byte[(int) split.getLength()];
				IOUtils.readFully(inputStream, buf, 0, buf.length);
				/*4、設置輸出文件內容value*/
				value.set(buf, 0, buf.length);
				/*5、設置輸出文件的key*/
				String fileName = split.getPath().toString();
				key.set(fileName);
			} catch (Exception e) {
				e.printStackTrace();
			}finally {
				/*6、關閉數據流*/
				IOUtils.closeStream(inputStream);
			}
			isProgress = false;
			return true;
		}
		return false;
	}

	@Override
	public Text getCurrentKey() throws IOException, InterruptedException {
		return key;
	}

	@Override
	public BytesWritable getCurrentValue() throws IOException, InterruptedException {
		return value;
	}

	@Override
	public float getProgress() throws IOException, InterruptedException {
		return 0;
	}

	@Override
	public void close() throws IOException {
		// TODO Auto-generated method stub
		
	}

}

其次,定製FileInputFormat

package com.lzj.hadoop.input.custom;

import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

public class CustomFileInputFormat extends FileInputFormat<Text, BytesWritable>{

	@Override
	protected boolean isSplitable(JobContext context, Path filename) {
		return false;
	}
	
	@Override
	public RecordReader<Text, BytesWritable> createRecordReader(InputSplit split, TaskAttemptContext context)
			throws IOException, InterruptedException {
		CustomRecordReader recorder = new CustomRecordReader();
		recorder.initialize(split, context);
		return recorder;
	}

}

然後創建Mapper類

package com.lzj.hadoop.input.custom;

import java.io.IOException;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class CstomMapper extends Mapper<Text, BytesWritable, Text, BytesWritable>{
	@Override
	protected void map(Text key, BytesWritable value, Mapper<Text, BytesWritable, Text, BytesWritable>.Context context)
			throws IOException, InterruptedException {
		//把文件名作爲key,文件內容作爲value
		context.write(key, value);
	}
}

再然後,創建Reducer類:

package com.lzj.hadoop.input.custom;

import java.io.IOException;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class CustomReducer extends Reducer<Text, BytesWritable, Text, BytesWritable>{

	@Override
	protected void reduce(Text key, Iterable<BytesWritable> values,
			Reducer<Text, BytesWritable, Text, BytesWritable>.Context context) throws IOException, InterruptedException {
		/*把key(文件名)+ value(文件內容)寫入到一個文件中*/
		context.write(key, values.iterator().next());
	}
}

最後,創建Driver驅動類:

package com.lzj.hadoop.input.custom;

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;

public class CustomDriver {
	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
		//1、獲取job的配置信息
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);
		//2、設置jar的加載路徑
		job.setJarByClass(CustomDriver.class);
		//3、分別設置Mapper和Reducer類
		job.setMapperClass(CstomMapper.class);
		job.setReducerClass(CustomReducer.class);
		//4、設置map的輸出類型
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(BytesWritable.class);
		//5、設置最終輸出的鍵值類型
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(BytesWritable.class);
		//6、設置輸入文件格式
		job.setInputFormatClass(CustomFileInputFormat.class);
		//7、設置輸出文件格式
		job.setOutputFormatClass(SequenceFileOutputFormat.class);
		//6、設置輸入輸出路徑
		FileInputFormat.setInputPaths(job, new Path("D:/tmp/word/in3"));
		FileOutputFormat.setOutputPath(job, new Path("D:/tmp/word/out7"));
		//8、提交任務
		boolean flag = job.waitForCompletion(true);
		System.out.println("flag ; " + flag);
	}
}

運行驅動類,會在out7目錄下生成一個part-r-00000文件,打開之後,發現把in3目錄下的1.txt、2.txt、3.txt的文件和內容寫入到了該文件中,以後直接讀取該文件,通過key(文件名)就可以直接獲取小文件的內容。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章