Snappy,Lzo,bzip2,gzip,deflate 都是hive常用的文件壓縮格式,各有所長,這裏咱們只關注具體文件的解壓
一、先貼代碼:
package compress;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.io.compress.CompressionInputStream;
public class Decompress {
public static final Log LOG = LogFactory.getLog(Decompress.class.getName());
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String name = "io.compression.codecs";
String value = "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec";
conf.set(name, value);
CompressionCodecFactory factory = new CompressionCodecFactory(conf);
for (int i = 0; i < args.length; ++i) {
CompressionCodec codec = factory.getCodec(new Path(args[i]));
if (codec == null) {
System.out.println("Codec for " + args[i] + " not found.");
} else {
CompressionInputStream in = null;
try {
in = codec.createInputStream(new java.io.FileInputStream(
args[i]));
byte[] buffer = new byte[100];
int len = in.read(buffer);
while (len > 0) {
System.out.write(buffer, 0, len);
len = in.read(buffer);
}
} finally {
if (in != null) {
in.close();
}
}
}
}
}
}
二、準備工作
1、準備依賴
簡要說明一下,這幾種壓縮文件相關的核心類爲:
org.apache.hadoop.io.compress.SnappyCodec
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.DeflateCodec,
org.apache.hadoop.io.compress.DefaultCodec,
首先我們需要這些依賴,我把解壓需要的依賴都放在了 /home/apache/test/lib/ 目錄下
此外還需要文件壓縮需要的本地庫文件,找到一臺裝有hadoop的環境,將 $HADOOP_HOME/lib/native 目錄複製過來,我放到了 /tmp/decompress 目錄下
2、準備壓縮文件
2.1、Snappy 文件
因爲我沒安裝Snappy庫,所以就用hive來創建snappy壓縮文件:
這只需要兩個參數:
hive.exec.compress.output 設置爲 true 來聲明將結果文件進行壓縮
mapred.output.compression.codec 用來設置具體的結果文件壓縮格式
在 hive shell 中檢查這兩個參數,設置爲我們需要的 Snappy 格式後,隨便運行一個SQL將結果寫到本地文件
> set hive.exec.compress.output;
hive.exec.compress.output=true
hive> set mapred.output.compression.codec;mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/snappy' select * from info900m limit 20;
至此,我們獲得了結果文件 /tmp/snappy/000000_0.snappy2.2、Lzo文件
同上,我們指定壓縮格式爲 lzo
> set hive.exec.compress.output;
hive.exec.compress.output=true
hive> set mapred.output.compression.codec;
mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
hive> set mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/lzo' select * from info900m limit 20;
獲得了結果文件 /tmp/lzo/000000_0.lzo 2.3、創建 bz2 文件和 gz 文件
創建bz2文件
[apache@indigo bz2]$ cp /etc/resolv.conf .
[apache@indigo bz2]$ cat resolv.conf
# Generated by NetworkManager
domain dhcp
search dhcp server
nameserver 192.168.0.1
創建 gz 文件
[apache@indigo bz2]$ tar zcf resolv.conf.gz resolv.conf
2.4、創建 deflate 文件
> set mapred.output.compression.codec;
mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec
hive> set mapred.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec;
hive>
> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/deflate' select * from info900m limit 20;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_1385947742139_0006, Tracking URL = http://indigo:8088/proxy/application_1385947742139_0006/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1385947742139_0006
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 1
2013-12-02 13:30:48,522 Stage-1 map = 0%, reduce = 0%
2013-12-02 13:30:56,271 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 1.2 sec
2013-12-02 13:30:57,330 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.85 sec
......
2013-12-02 13:31:15,508 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.85 sec
2013-12-02 13:31:16,552 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.85 sec
MapReduce Total cumulative CPU time: 4 seconds 850 msec
Ended Job = job_1385947742139_0006 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1385947742139_0006_m_000003 (and more) from job job_1385947742139_0006
Task with the most failures(4):
-----
Task ID:
task_1385947742139_0006_r_000000
URL:
http://indigo:8088/taskdetails.jsp?jobid=job_1385947742139_0006&tipid=task_1385947742139_0006_r_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"20130526","_col1":"20130526","_col2":"SXY","_col3":"4577020","_col4":"20","_col5":"P029124","_col6":"1","_col7":"612423196707110625","_col8":"","_col9":"Y1","_col10":"20130526"},"alias":0}
at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:270)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:460)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:407)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"20130526","_col1":"20130526","_col2":"SXY","_col3":"4577020","_col4":"20","_col5":"P029124","_col6":"1","_col7":"612423196707110625","_col8":"","_col9":"Y1","_col10":"20130526"},"alias":0}
at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:258)
... 7 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: Compression codec org.apache.hadoop.io.compress.DefaultCode was not found.
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:479)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:543)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800)
at org.apache.hadoop.hive.ql.exec.LimitOperator.processOp(LimitOperator.java:51)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800)
at org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:45)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)
at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:249)
... 7 more
Caused by: java.lang.IllegalArgumentException: Compression codec org.apache.hadoop.io.compress.DefaultCode was not found.
at org.apache.hadoop.mapred.FileOutputFormat.getOutputCompressorClass(FileOutputFormat.java:94)
at org.apache.hadoop.hive.ql.exec.Utilities.getFileExtension(Utilities.java:910)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:469)
... 16 more
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.io.compress.DefaultCode not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493)
at org.apache.hadoop.mapred.FileOutputFormat.getOutputCompressorClass(FileOutputFormat.java:91)
... 18 more
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 4 Reduce: 1 Cumulative CPU: 4.85 sec HDFS Read: 460084 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 4 seconds 850 msec
這裏 hive 居然沒讀 hadoop 的 classpath ,所以只好將依賴放到 hive classpath下,重啓hive ,重新查詢即可
cp /usr/lib/hadoop/hadoop-common-2.0.0-cdh4.3.0.jar /usr/lib/hive/lib
全部就緒了,咱編譯好上面的類後就開始吧
三、解壓
1、snappy文件
需要注意一下,參數爲要解壓的文件名,創建對應的 Decompression 的依據是壓縮文件擴展名,所以這裏擴展名不能隨便改,下面解壓剛纔獲取的snappy文件
[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/test/lib/*" compress.Decompress /tmp/snappy/000000_0.snappy
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
..................文件內容省略................................
2、lzo文件
因爲我安裝了lzo庫,所以可以直接解壓
[apache@indigo lzo]$ lzop -d 000000_0.lzo
[apache@indigo lzo]$ ll
total 8
-rw-r--r--. 1 apache apache 1650 Dec 2 13:12 000000_0
-rwxr-xr-x. 1 apache apache 848 Dec 2 13:12 000000_0.lzo
用compress.Decompress 指定lzo壓縮文件名即可:[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/test/lib/*" compress.Decompress /tmp/lzo/000000_0.lzo
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
3、bzip2 文件
[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/diary/1202/lib/*" compress.Decompress /tmp/bz2/resolv.conf.bz2
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
resolv.conf0000644000175000017500000000012412247014427012463 0ustar apacheapache# Generated by NetworkManager
domain dhcp
search dhcp server
nameserver 192.168.0.1
4、gzip 文件
[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/diary/1202/lib/*" compress.Decompress /tmp/bz2/resolv.conf.gz
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
resolv.conf0000644000175000017500000000012412247014427012463 0ustar apacheapache# Generated by NetworkManager
domain dhcp
search dhcp server
nameserver 192.168.0.1
5、deflate文件
[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/diary/1202/lib/*" compress.Decompress /tmp/deflate/000000_0.deflate
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
..................文件內容省略................................