Snappy,Lzo,bzip2,gzip,deflate文件解壓


Snappy,Lzo,bzip2,gzip,deflate 都是hive常用的文件壓縮格式,各有所長,這裏咱們只關注具體文件的解壓

一、先貼代碼:

package compress;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.io.compress.CompressionInputStream;

public class Decompress {

	public static final Log LOG = LogFactory.getLog(Decompress.class.getName());

	public static void main(String[] args) throws Exception {

		Configuration conf = new Configuration();
		String name = "io.compression.codecs";
		String value = "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec";
		conf.set(name, value);
		CompressionCodecFactory factory = new CompressionCodecFactory(conf);
		for (int i = 0; i < args.length; ++i) {
			CompressionCodec codec = factory.getCodec(new Path(args[i]));
			if (codec == null) {
				System.out.println("Codec for " + args[i] + " not found.");
			} else {
				CompressionInputStream in = null;
				try {
					in = codec.createInputStream(new java.io.FileInputStream(
							args[i]));
					byte[] buffer = new byte[100];
					int len = in.read(buffer);
					while (len > 0) {
						System.out.write(buffer, 0, len);
						len = in.read(buffer);
					}
				} finally {
					if (in != null) {
						in.close();
					}
				}
			}
		}
	}
}

二、準備工作

1、準備依賴

簡要說明一下,這幾種壓縮文件相關的核心類爲:

org.apache.hadoop.io.compress.SnappyCodec
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.DeflateCodec,
org.apache.hadoop.io.compress.DefaultCodec,

首先我們需要這些依賴,我把解壓需要的依賴都放在了 /home/apache/test/lib/ 目錄下

此外還需要文件壓縮需要的本地庫文件,找到一臺裝有hadoop的環境,將 $HADOOP_HOME/lib/native  目錄複製過來,我放到了 /tmp/decompress 目錄下

2、準備壓縮文件

2.1、Snappy 文件

因爲我沒安裝Snappy庫,所以就用hive來創建snappy壓縮文件:

這只需要兩個參數:

hive.exec.compress.output 設置爲 true 來聲明將結果文件進行壓縮

mapred.output.compression.codec 用來設置具體的結果文件壓縮格式

在 hive shell 中檢查這兩個參數,設置爲我們需要的 Snappy 格式後,隨便運行一個SQL將結果寫到本地文件

    > set hive.exec.compress.output;
hive.exec.compress.output=true
hive> set mapred.output.compression.codec;mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/snappy' select * from info900m limit 20;
至此,我們獲得了結果文件 /tmp/snappy/000000_0.snappy

2.2、Lzo文件

同上,我們指定壓縮格式爲 lzo

    > set hive.exec.compress.output;                                                 
hive.exec.compress.output=true
hive> set mapred.output.compression.codec;                                           
mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
hive> set mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/lzo' select * from info900m limit 20;
獲得了結果文件 /tmp/lzo/000000_0.lzo 

2.3、創建 bz2 文件和 gz 文件

創建bz2文件
[apache@indigo bz2]$ cp /etc/resolv.conf .
[apache@indigo bz2]$ cat resolv.conf
# Generated by NetworkManager
domain dhcp
search dhcp server
nameserver 192.168.0.1

創建 gz 文件
[apache@indigo bz2]$ tar zcf resolv.conf.gz resolv.conf

2.4、創建 deflate 文件

    > set mapred.output.compression.codec;                                        
mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec
hive> set mapred.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec; 
hive> 
    > INSERT OVERWRITE LOCAL DIRECTORY '/tmp/deflate' select * from info900m limit 20;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_1385947742139_0006, Tracking URL = http://indigo:8088/proxy/application_1385947742139_0006/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1385947742139_0006
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 1
2013-12-02 13:30:48,522 Stage-1 map = 0%,  reduce = 0%
2013-12-02 13:30:56,271 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 1.2 sec
2013-12-02 13:30:57,330 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.85 sec
......
2013-12-02 13:31:15,508 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.85 sec
2013-12-02 13:31:16,552 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.85 sec
MapReduce Total cumulative CPU time: 4 seconds 850 msec
Ended Job = job_1385947742139_0006 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1385947742139_0006_m_000003 (and more) from job job_1385947742139_0006

Task with the most failures(4): 
-----
Task ID:
  task_1385947742139_0006_r_000000

URL:
  http://indigo:8088/taskdetails.jsp?jobid=job_1385947742139_0006&tipid=task_1385947742139_0006_r_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"20130526","_col1":"20130526","_col2":"SXY","_col3":"4577020","_col4":"20","_col5":"P029124","_col6":"1","_col7":"612423196707110625","_col8":"","_col9":"Y1","_col10":"20130526"},"alias":0}
	at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:270)
	at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:460)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:407)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"20130526","_col1":"20130526","_col2":"SXY","_col3":"4577020","_col4":"20","_col5":"P029124","_col6":"1","_col7":"612423196707110625","_col8":"","_col9":"Y1","_col10":"20130526"},"alias":0}
	at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:258)
	... 7 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: Compression codec org.apache.hadoop.io.compress.DefaultCode was not found.
	at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:479)
	at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:543)
	at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)
	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800)
	at org.apache.hadoop.hive.ql.exec.LimitOperator.processOp(LimitOperator.java:51)
	at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)
	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800)
	at org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:45)
	at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)
	at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:249)
	... 7 more
Caused by: java.lang.IllegalArgumentException: Compression codec org.apache.hadoop.io.compress.DefaultCode was not found.
	at org.apache.hadoop.mapred.FileOutputFormat.getOutputCompressorClass(FileOutputFormat.java:94)
	at org.apache.hadoop.hive.ql.exec.Utilities.getFileExtension(Utilities.java:910)
	at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:469)
	... 16 more
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.io.compress.DefaultCode not found
	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493)
	at org.apache.hadoop.mapred.FileOutputFormat.getOutputCompressorClass(FileOutputFormat.java:91)
	... 18 more


FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched: 
Job 0: Map: 4  Reduce: 1   Cumulative CPU: 4.85 sec   HDFS Read: 460084 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 4 seconds 850 msec
這裏 hive 居然沒讀 hadoop 的 classpath ,所以只好將依賴放到 hive classpath下,重啓hive ,重新查詢即可

cp /usr/lib/hadoop/hadoop-common-2.0.0-cdh4.3.0.jar /usr/lib/hive/lib

全部就緒了,咱編譯好上面的類後就開始吧

三、解壓

1、snappy文件

需要注意一下,參數爲要解壓的文件名,創建對應的 Decompression 的依據是壓縮文件擴展名,所以這裏擴展名不能隨便改,下面解壓剛纔獲取的snappy文件

[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/test/lib/*" compress.Decompress /tmp/snappy/000000_0.snappy
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
..................文件內容省略................................

2、lzo文件 

因爲我安裝了lzo庫,所以可以直接解壓 微笑

[apache@indigo lzo]$ lzop -d 000000_0.lzo 
[apache@indigo lzo]$ ll
total 8
-rw-r--r--. 1 apache apache 1650 Dec  2 13:12 000000_0
-rwxr-xr-x. 1 apache apache  848 Dec  2 13:12 000000_0.lzo
用compress.Decompress 指定lzo壓縮文件名即可:
[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/test/lib/*" compress.Decompress /tmp/lzo/000000_0.lzo
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.


3、bzip2 文件

[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/diary/1202/lib/*" compress.Decompress /tmp/bz2/resolv.conf.bz2
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
resolv.conf0000644000175000017500000000012412247014427012463 0ustar  apacheapache# Generated by NetworkManager
domain dhcp
search dhcp server
nameserver 192.168.0.1

4、gzip 文件

[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/diary/1202/lib/*" compress.Decompress /tmp/bz2/resolv.conf.gz
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
resolv.conf0000644000175000017500000000012412247014427012463 0ustar  apacheapache# Generated by NetworkManager
domain dhcp
search dhcp server
nameserver 192.168.0.1

5、deflate文件

[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/diary/1202/lib/*" compress.Decompress /tmp/deflate/000000_0.deflate
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
..................文件內容省略................................



發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章