下面是一個使用gzip工具壓縮文件的例子。將文件/user/hadoop/aa.txt進行壓縮,壓縮後爲/user/hadoop/text.gz
package com.hdfs;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.io.compress.CompressionInputStream;
import org.apache.hadoop.io.compress.CompressionOutputStream;
import org.apache.hadoop.util.ReflectionUtils;
public class CodecTest {
//壓縮文件
public static void compress(String codecClassName) throws Exception{
Class<?> codecClass = Class.forName(codecClassName);
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
CompressionCodec codec = (CompressionCodec)ReflectionUtils.newInstance(codecClass, conf);
//指定壓縮文件路徑
FSDataOutputStream outputStream = fs.create(new Path("/user/hadoop/text.gz"));
//指定要被壓縮的文件路徑
FSDataInputStream in = fs.open(new Path("/user/hadoop/aa.txt"));
//創建壓縮輸出流
CompressionOutputStream out = codec.createOutputStream(outputStream);
IOUtils.copyBytes(in, out, conf);
IOUtils.closeStream(in);
IOUtils.closeStream(out);
}
//解壓縮
public static void uncompress(String fileName) throws Exception{
Class<?> codecClass = Class.forName("org.apache.hadoop.io.compress.GzipCodec");
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
CompressionCodec codec = (CompressionCodec)ReflectionUtils.newInstance(codecClass, conf);
FSDataInputStream inputStream = fs.open(new Path("/user/hadoop/text.gz"));
//把text文件裏到數據解壓,然後輸出到控制檯
InputStream in = codec.createInputStream(inputStream);
IOUtils.copyBytes(in, System.out, conf);
IOUtils.closeStream(in);
}
//使用文件擴展名來推斷二來的codec來對文件進行解壓縮
public static void uncompress1(String uri) throws IOException{
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path inputPath = new Path(uri);
CompressionCodecFactory factory = new CompressionCodecFactory(conf);
CompressionCodec codec = factory.getCodec(inputPath);
if(codec == null){
System.out.println("no codec found for " + uri);
System.exit(1);
}
String outputUri = CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());
InputStream in = null;
OutputStream out = null;
try {
in = codec.createInputStream(fs.open(inputPath));
out = fs.create(new Path(outputUri));
IOUtils.copyBytes(in, out, conf);
} finally{
IOUtils.closeStream(out);
IOUtils.closeStream(in);
}
}
public static void main(String[] args) throws Exception {
//compress("org.apache.hadoop.io.compress.GzipCodec");
//uncompress("text");
uncompress1("hdfs://master:9000/user/hadoop/text.gz");
}
}
首先執行77行進行壓縮,壓縮後執行第78行進行解壓縮,這裏解壓到標準輸出,所以執行78行會再控制檯看到文件/user/hadoop/aa.txt的內容。如果執行79行的話會將文件解壓到/user/hadoop/text,他是根據/user/hadoop/text.gz的擴展名判斷使用哪個解壓工具進行解壓的。解壓後的路徑就是去掉擴展名。
進行文件壓縮後,在執行命令./hadoop fs -ls /user/hadoop/查看文件信息,如下:
1 [hadoop@master bin]$ ./hadoop fs -ls /user/hadoop/ 2 Found 7 items 3 -rw-r--r-- 3 hadoop supergroup 76805248 2013-06-17 23:55 /user/hadoop/aa.mp4 4 -rw-r--r-- 3 hadoop supergroup 520 2013-06-17 22:29 /user/hadoop/aa.txt 5 drwxr-xr-x - hadoop supergroup 0 2013-06-16 17:19 /user/hadoop/input 6 drwxr-xr-x - hadoop supergroup 0 2013-06-16 19:32 /user/hadoop/output 7 drwxr-xr-x - hadoop supergroup 0 2013-06-18 17:08 /user/hadoop/test 8 drwxr-xr-x - hadoop supergroup 0 2013-06-18 19:45 /user/hadoop/test1 9 -rw-r--r-- 3 hadoop supergroup 46 2013-06-19 20:09 /user/hadoop/text.gz
第4行爲壓縮之前的文件,大小爲520個字節。第9行爲壓縮後的文件,大小爲46個字節。由此可以看出上面講的壓縮的兩大好處了。