一、概述:
數據壓縮是mapreduce的一種優化策略:通過壓縮編碼對mapper或者reducer的輸出進行壓縮,以減少磁盤IO,提高MR程序運行速度(但相應增加了cpu運算負擔)
二、基本原則:
運算密集型的job,少用壓縮
IO密集型的job,多用壓縮
注:
1、 Mapreduce支持將map輸出的結果或者reduce輸出的結果進行壓縮,以減少網絡IO或最終輸出數據的體積
2、 壓縮特性運用得當能提高性能,但運用不當也可能降低性能
三、MR支持的壓縮編碼
四、MR數據壓縮的配置
四、MR數據壓縮的配置
1、Reducer輸出壓縮
a、配置文件中配置
mapreduce.output.fileoutputformat.compress=false
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec
mapreduce.output.fileoutputformat.compress.type=RECORD
b、代碼中配置
Job job = Job.getInstance(conf);
FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, (Class<? extends CompressionCodec>) Class.forName(""));
2、Mapper輸出壓縮
a、配置文件中配置
mapreduce.map.output.compress=false
mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.DefaultCodec
b、代碼中配置
conf.setBoolean(Job.MAP_OUTPUT_COMPRESS, true);
conf.setClass(Job.MAP_OUTPUT_COMPRESS_CODEC, GzipCodec.class, CompressionCodec.class);
3、壓縮文件的讀取
Hadoop自帶的InputFormat類內置支持壓縮文件的讀取,比如TextInputformat類,在其initialize方法中:
public void initialize(InputSplit genericSplit,
TaskAttemptContext context) throws IOException {
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);
start = split.getStart();
end = start + split.getLength();
final Path file = split.getPath();
// open the file and seek to the start of the split
final FileSystem fs = file.getFileSystem(job);
fileIn = fs.open(file);
//根據文件後綴名創建相應壓縮編碼的codec
CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);
if (null!=codec) {
isCompressedInput = true;
decompressor = CodecPool.getDecompressor(codec);
//判斷是否屬於可切片壓縮編碼類型
if (codec instanceof SplittableCompressionCodec) {
final SplitCompressionInputStream cIn =
((SplittableCompressionCodec)codec).createInputStream(
fileIn, decompressor, start, end,
SplittableCompressionCodec.READ_MODE.BYBLOCK);
//如果是可切片壓縮編碼,則創建一個CompressedSplitLineReader讀取壓縮數據
in = new CompressedSplitLineReader(cIn, job,
this.recordDelimiterBytes);
start = cIn.getAdjustedStart();
end = cIn.getAdjustedEnd();
filePosition = cIn;
} else {
//如果是不可切片壓縮編碼,則創建一個SplitLineReader讀取壓縮數據,並將文件輸入流轉換成解壓數據流傳遞給普通SplitLineReader讀取
in = new SplitLineReader(codec.createInputStream(fileIn,
decompressor), job, this.recordDelimiterBytes);
filePosition = fileIn;
}
} else {
fileIn.seek(start);
//如果不是壓縮文件,則創建普通SplitLineReader讀取數據
in = new SplitLineReader(fileIn, job, this.recordDelimiterBytes);
filePosition = fileIn;
}
喜歡的朋友點點關注哦~~