1.InputFormat
InputFormat常見子類包括:
TextInputFormat (普通文本文件,MR框架默認的讀取實現類型)
KeyValueTextInputFormat(讀取一行文本數據按照指定分隔符,把數據封裝爲kv類型)
NLineInputF ormat(讀取數據按照行數進行劃分分片)
CombineTextInputFormat(合併小文件,避免啓動過多MapTask任務)
自定義InputFormat
2.CombineTextInputFormat
// 如果不設置InputFormat,它默認用的是TextInputFormat.class
job.setInputFormatClass(CombineTextInputFormat.class);
//虛擬存儲切片最大值設置4m
CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);
判斷虛擬存儲的文件大小是否大於setMaxInputSplitSize值,大於等於則單獨形成一個 切片。
2.自定義RecordReader
public class CustomFileInputformat extends FileInputFormat<Text, BytesWritable> {
/**
* 文件不可切分
*/
@Override
protected boolean isSplitable(JobContext context, Path filename) {
return false;
}
/**
* createRecordReader 讀取文本的對象
*
* @param inputSplit
* @param taskAttemptContext
* @return
* @throws IOException
* @throws InterruptedException
*/
@Override
public RecordReader<Text, BytesWritable> createRecordReader(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
CustomRecordReader recordReader = new CustomRecordReader();
recordReader.initialize(inputSplit, taskAttemptContext);
return recordReader;
}
}
public class CustomRecordReader extends RecordReader<Text, BytesWritable> {
private Configuration configuration;
/**
* 切片
*/
private FileSplit split;
/**
* 輸出的kv
*/
private Text k = new Text();
private BytesWritable value = new BytesWritable();
/**
* 是否讀取到內容的標識符
*/
private boolean flag = true;
/**
* 初始化方法 把切片已經上下文提升爲全局
*
* @param inputSplit
* @param taskAttemptContext
* @throws IOException
* @throws InterruptedException
*/
@Override
public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
//獲取到文件切片以及配置文件對象
this.split = (FileSplit) inputSplit;
configuration = taskAttemptContext.getConfiguration();
}
/**
* 用來讀取數據的方法
*
* @return
* @throws IOException
* @throws InterruptedException
*/
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if (flag) {
// 1 定義緩存區 存放讀取的數據
byte[] contents = new byte[(int) split.getLength()];
FileSystem fs = null;
FSDataInputStream fis = null;
// 2 獲取文件系統
Path path = split.getPath();
fs = path.getFileSystem(configuration);
// 3 讀取數據
fis = fs.open(path);
// 4 讀取文件內容
IOUtils.readFully(fis, contents, 0, contents.length);
// 5 輸出文件內容
value.set(contents, 0, contents.length);
// 6 獲取文件路徑及名稱
String name = split.getPath().toString();
// 7 設置輸出的key值
k.set(name);
IOUtils.closeStream(fis);
flag = false;
return true;
}
return false;
}
/**
* 獲取key
*
* @return
* @throws IOException
* @throws InterruptedException
*/
@Override
public Text getCurrentKey() throws IOException, InterruptedException {
return k;
}
/**
* 獲取value
*
* @return
* @throws IOException
* @throws InterruptedException
*/
@Override
public BytesWritable getCurrentValue() throws IOException, InterruptedException {
return value;
}
/**
* 獲取進度
*
* @return
* @throws IOException
* @throws InterruptedException
*/
@Override
public float getProgress() throws IOException, InterruptedException {
return 0;
}
/**
* 關閉
*
* @throws IOException
*/
@Override
public void close() throws IOException {
}
4.自定義OutputFormat
2.自定義RecordWriter
public class CustomOutputFormat extends FileOutputFormat<Text, NullWritable> {
@Override
public RecordWriter<Text, NullWritable> getRecordWriter(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
//獲取文件系統對象
FileSystem fs = FileSystem.get(taskAttemptContext.getConfiguration());
//指定輸出數據的文件
Path lagouPath = new Path("e:/lagou.log");
Path otherLog = new Path("e:/other.log");
//獲取輸出流
final FSDataOutputStream lagouOut = fs.create(lagouPath);
final FSDataOutputStream otherOut = fs.create(otherLog);
return new CustomWriter(lagouOut, otherOut);
}
}
public class CustomWriter extends RecordWriter<Text, NullWritable> {
private FSDataOutputStream lagouOut;
private FSDataOutputStream otherOut;
public CustomWriter(FSDataOutputStream lagouOut, FSDataOutputStream otherOut) {
this.lagouOut = lagouOut;
this.otherOut = otherOut;
}
@Override
public void write(Text key, NullWritable nullWritable) throws IOException, InterruptedException {
// 判斷是否包含“lagou”輸出到不同文件
if (key.toString().contains("lagou")) {
lagouOut.write(key.toString().getBytes());
lagouOut.write("\r\n".getBytes());
} else {
otherOut.write(key.toString().getBytes());
otherOut.write("\r\n".getBytes());
}
}
@Override
public void close(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
IOUtils.closeStream(lagouOut);
IOUtils.closeStream(otherOut);
}
}