按照Shark官方網站的說法,Shark在RAM的時候,比Hive快90倍,這個報告看起來很不錯,但是在不同的測試環境和不同的優化條件以及不同的用例場景下,結果都是不同的,所以決定測試了一下Shark0.91搭建在Spark1.0.0和amplab Hive0.11上的性能。
一、集羣環境
3臺Slave
二、軟件環境
Spark1.0.0 with hadoop0.20.2-cdh3u5Shark0.91 + amplab Hive0.11
對比測試VS.
Apache Hive 0.11
三、測試對象
21G 的Text File 文件建立一個表,對該表進行各種查詢的性能測試。主要分爲數據全部cache在內存時的性能 和 on disk 時的性能比較。
[hadoop@wh-8-210 shark]$ hadoop dfs -ls /user/hive/warehouse/log/
Found 1 items
-rw-r--r-- 3 hadoop supergroup 22499035249 2014-06-16 18:32 /user/hive/warehouse/log/21gfile
create table log (
c1 string,
c2 string,
c3 string,
c4 string,
c5 string,
c6 string,
c7 string,
c8 string,
c9 string,
c10 string,
c11 string,
c12 string,
c13 string
) row format delimited fields terminated by '\t' stored as textfile;
load data inpath '/user/hive/warehouse/21gfile' into table log;
示例數據:
[10.1.8.210:7100] shark> select * from log_cached limit 10;
2014-05-15 101289 13836998753 2 2010-08-23 22:36:50 0 0 2010-06-02 16:55:25 2010-06-02 16:55:25 None 0
2014-05-15 104497 15936529112 2 2011-01-11 09:58:47 0 0 2011-01-11 09:58:50 2011-01-19 09:58:50 61.172.242.36 2011-01-19 08:59:47 0
2014-05-15 105470 15000037972 0 2013-07-21 11:35:26 0 0 2013-07-21 11:29:08 2013-07-21 11:29:08 2013-07-21 11:35:26 0
2014-05-15 111864 13967013250 2 2010-11-28 21:06:56 0 0 2010-11-28 21:06:57 2010-12-06 21:06:57 61.172.242.36 2010-12-06 20:08:11 0
2014-05-15 112922 13766378550 2 2010-08-23 22:36:50 0 0 2010-03-29 00:08:17 2010-03-29 00:08:17 None 0
2014-05-15 113685 15882981310 2 2011-04-28 18:24:57 0 0 2011-04-28 17:38:37 2011-04-28 17:38:37 127.0.0.1 None 0
2014-05-15 116368 15957685805 2 2011-06-27 17:05:55 0 0 2011-06-27 17:06:01 2011-07-05 17:06:01 10.129.20.108 2011-07-05 16:11:05 0
2014-05-15 136020 13504661323 2 2012-02-11 18:51:17 0 0 2012-02-11 18:51:19 2012-02-19 18:51:19 10.129.20.109 2012-03-03 14:37:05 0
2014-05-15 137597 15993791204 2 2011-12-07 00:45:03 0 0 2011-12-07 00:44:59 2011-12-15 00:44:59 10.129.20.98 2011-12-14 23:45:40 0
2014-05-15 155020 13760211160 2 2011-05-25 14:27:24 0 0 2011-05-25 14:02:54 2011-05-25 14:02:54 127.0.0.1 2011-07-28 16:42:21 0
Time taken (including network latency): 0.33 seconds
將21G數據全部cache到內存
cache rdd
sql : CREATE TABLE log_cached TBLPROPERTIES ("shark.cache" = "true") AS SELECT * from log;
Time taken (including network latency): 282.006 seconds
[10.1.8.210:7100] shark> select * from log_cached limit 1;
2014-05-15 101289 13836998753 2 2010-08-23 22:36:50 0 0 2010-06-02 16:55:25 2010-06-02 16:55:25 None 0
cache後如圖336個partition:
四、測試用例及結果
這個測試沒有對hive和shark進行任何調優,均在相同的環境下進行測試,一下是測試結果:用例:
1、測試count
2、測試sum
3、測試avg
4、測試group by
5、測試join
6、測試select
7、測試sort
8、測試一段稍微複雜的Sql
測試結果:
以下是測試結果圖:
測試count、sum、avg、group by
單位(秒)
shark(memory) |
shark(disk) |
apache hive 0.11 |
|
count |
5.053 |
26.223 |
41.255 |
sum |
13.207 |
33.401 |
48.204 |
avg |
13.88 |
33.519 |
48.159 |
group by |
8.14 |
29.852 |
54.705 |
測試join,select,sort
單位(秒)
shark(memory) |
shark(disk) |
apache hive 0.11 |
|
join |
194.98 |
272.36 |
236.203 |
select |
178.53 |
161.01 |
172.762 |
sort |
134.23 |
161.07 |
161.789 |
測試複雜sql
測試一段稍微複雜的Sql:
set mapred.reduce.tasks=200;
create table complex as
select a.c4, a.time from
(select c4, max(c10) time from log_cached group by c4 ) a
join
(select c10 time, c11 from log_cached group by c10,c11 )b
on a.time = b.time
執行該sql, hive會啓動5個job
Total MapReduce jobs = 5
Launching Job 1 out of 5
Number of reduce tasks not specified.Defaulting to jobconf value of: 200
In order to change the average load for areducer (in by…
Ended Job = job_201406131753_0033
Moving data to:hdfs://10.1.8.210:9000/user/hive/warehouse/complex
Table default.complex stats:[num_partitions: 0, num_files: 200, num_rows: 0, total_size: 7769834233,raw_data_size: 0]
242791443 Rows loaded tohdfs://10.1.8.210:9000/tmp/hive-hadoop/hive_2014-06-20_17-05-17_199_1498492429758626519/-ext-10000
MapReduce Jobs Launched:
Job 0: Map: 84 Reduce: 200 Cumulative CPU: 3709.42 sec HDFSRead: 22542324372 HDFS Write: 8287517376 SUCCESS
Job 1: Map: 84 Reduce: 200 Cumulative CPU: 3015.82 sec HDFSRead: 22542324372 HDFS Write: 3528775299 SUCCESS
Job 2: Map: 43 Reduce: 200 Cumulative CPU: 4442.77 sec HDFSRead: 11816414510 HDFS Write: 7769834233 SUCCESS
Total MapReduce CPU Time Spent: 0 days 3hours 6 minutes 8 seconds 10 msec
OK
Time taken: 964.809seconds
單位(秒)
shark(memory) |
shark(disk) |
apache hive 0.11 |
|
complex |
349.858 |
388.844 |
964.809 |
我們知道,Shark的執行引擎是Spark, spark在執行job的時候,會有一個DAGScheduler,少了多餘了IO,大大提高了執行效率。
所以在執行一段稍微複雜的sql,而不是向前面簡單sql的測試,Shark的其它方面的優勢也體現出來了。
總結:
1. 在簡單的sql查詢中,普通的函數如count,avg, sum, group by均比hive快3-5x
2. 在簡單的sql查詢中join,select, sort 方面shark和hive不相上下。
3. 在複雜SQL查詢中,Shark的優勢體現出來,幾乎是hive的3x倍,越複雜的sql,hive的性能損失越多,而shark針對複雜SQL有優化。
-EOF-