Shark性能測試

按照Shark官方網站的說法，Shark在RAM的時候，比Hive快90倍，這個報告看起來很不錯，但是在不同的測試環境和不同的優化條件以及不同的用例場景下，結果都是不同的，所以決定測試了一下Shark0.91搭建在Spark1.0.0和amplab Hive0.11上的性能。

一、集羣環境

前面介紹瞭如何搭建集羣：可以參見Shark集羣搭建配置

1臺Master （Master僅僅是Master，不當slave）
3臺Slave

二、軟件環境

Spark1.0.0 with hadoop0.20.2-cdh3u5
Shark0.91 + amplab Hive0.11
對比測試VS.
Apache Hive 0.11

三、測試對象

21G 的Text File 文件建立一個表，對該表進行各種查詢的性能測試。
主要分爲數據全部cache在內存時的性能和 on disk 時的性能比較。

[hadoop@wh-8-210 shark]$ hadoop dfs -ls /user/hive/warehouse/log/
Found 1 items
-rw-r--r--   3 hadoop supergroup 22499035249 2014-06-16 18:32 /user/hive/warehouse/log/21gfile

create table log
(
c1 string,
c2 string,
c3 string,
c4 string,
c5 string,
c6 string,
c7 string,
c8 string,
c9 string,
c10 string,
c11 string,
c12 string,
c13 string
) row format delimited fields terminated by '\t' stored as textfile;

load data inpath '/user/hive/warehouse/21gfile' into table log;
示例數據：

[10.1.8.210:7100] shark> select * from log_cached limit 10;
        2014-05-15      101289  13836998753     2       2010-08-23 22:36:50     0       0       2010-06-02 16:55:25     2010-06-02 16:55:25             None    0
        2014-05-15      104497  15936529112     2       2011-01-11 09:58:47     0       0       2011-01-11 09:58:50     2011-01-19 09:58:50     61.172.242.36   2011-01-19 08:59:47      0
        2014-05-15      105470  15000037972     0       2013-07-21 11:35:26     0       0       2013-07-21 11:29:08     2013-07-21 11:29:08             2013-07-21 11:35:26     0
        2014-05-15      111864  13967013250     2       2010-11-28 21:06:56     0       0       2010-11-28 21:06:57     2010-12-06 21:06:57     61.172.242.36   2010-12-06 20:08:11      0
        2014-05-15      112922  13766378550     2       2010-08-23 22:36:50     0       0       2010-03-29 00:08:17     2010-03-29 00:08:17             None    0
        2014-05-15      113685  15882981310     2       2011-04-28 18:24:57     0       0       2011-04-28 17:38:37     2011-04-28 17:38:37     127.0.0.1       None    0
        2014-05-15      116368  15957685805     2       2011-06-27 17:05:55     0       0       2011-06-27 17:06:01     2011-07-05 17:06:01     10.129.20.108   2011-07-05 16:11:05      0
        2014-05-15      136020  13504661323     2       2012-02-11 18:51:17     0       0       2012-02-11 18:51:19     2012-02-19 18:51:19     10.129.20.109   2012-03-03 14:37:05      0
        2014-05-15      137597  15993791204     2       2011-12-07 00:45:03     0       0       2011-12-07 00:44:59     2011-12-15 00:44:59     10.129.20.98    2011-12-14 23:45:40      0
        2014-05-15      155020  13760211160     2       2011-05-25 14:27:24     0       0       2011-05-25 14:02:54     2011-05-25 14:02:54     127.0.0.1       2011-07-28 16:42:21      0
Time taken (including network latency): 0.33 seconds

將21G數據全部cache到內存
cache rdd

sql : CREATE TABLE log_cached TBLPROPERTIES ("shark.cache" = "true") AS SELECT * from log;
Time taken (including network latency): 282.006 seconds

[10.1.8.210:7100] shark> select * from log_cached limit 1; 
        2014-05-15      101289  13836998753     2       2010-08-23 22:36:50     0       0       2010-06-02 16:55:25     2010-06-02 16:55:25             None    0

cache後如圖336個partition：

四、測試用例及結果

這個測試沒有對hive和shark進行任何調優，均在相同的環境下進行測試，一下是測試結果：

用例：

1、測試count
2、測試sum
3、測試avg
4、測試group by
5、測試join
6、測試select
7、測試sort

8、測試一段稍微複雜的Sql

測試結果：

以下是測試結果圖：

測試count、sum、avg、group by

單位(秒)

shark(memory)	shark(disk)	apache hive 0.11
count	5.053	26.223	41.255
sum	13.207	33.401	48.204
avg	13.88	33.519	48.159
group by	8.14	29.852	54.705

測試join,select,sort

單位(秒)

shark(memory)	shark(disk)	apache hive 0.11
join	194.98	272.36	236.203
select	178.53	161.01	172.762
sort	134.23	161.07	161.789

測試複雜sql

測試一段稍微複雜的Sql：

set mapred.reduce.tasks=200;
create table complex as 
select a.c4, a.time from 
(select c4, max(c10) time from log_cached group by c4 ) a
join 
(select c10 time, c11 from log_cached group by c10,c11 )b
on a.time = b.time

執行該sql, hive會啓動5個job

Total MapReduce jobs = 5
Launching Job 1 out of 5
Number of reduce tasks not specified.Defaulting to jobconf value of: 200
In order to change the average load for areducer (in by…
Ended Job = job_201406131753_0033
Moving data to:hdfs://10.1.8.210:9000/user/hive/warehouse/complex
Table default.complex stats:[num_partitions: 0, num_files: 200, num_rows: 0, total_size: 7769834233,raw_data_size: 0]
242791443 Rows loaded tohdfs://10.1.8.210:9000/tmp/hive-hadoop/hive_2014-06-20_17-05-17_199_1498492429758626519/-ext-10000
MapReduce Jobs Launched:
Job 0: Map: 84  Reduce: 200  Cumulative CPU: 3709.42 sec   HDFSRead: 22542324372 HDFS Write: 8287517376 SUCCESS
Job 1: Map: 84  Reduce: 200  Cumulative CPU: 3015.82 sec   HDFSRead: 22542324372 HDFS Write: 3528775299 SUCCESS
Job 2: Map: 43  Reduce: 200  Cumulative CPU: 4442.77 sec   HDFSRead: 11816414510 HDFS Write: 7769834233 SUCCESS
Total MapReduce CPU Time Spent: 0 days 3hours 6 minutes 8 seconds 10 msec
OK
Time taken: 964.809seconds

單位（秒）

shark(memory)	shark(disk)	apache hive 0.11
complex	349.858	388.844	964.809

我們知道，Shark的執行引擎是Spark, spark在執行job的時候，會有一個DAGScheduler，少了多餘了IO，大大提高了執行效率。
所以在執行一段稍微複雜的sql，而不是向前面簡單sql的測試,Shark的其它方面的優勢也體現出來了。

總結：

1. 在簡單的sql查詢中，普通的函數如count,avg, sum, group by均比hive快3-5x

2. 在簡單的sql查詢中join,select, sort 方面shark和hive不相上下。

3. 在複雜SQL查詢中，Shark的優勢體現出來，幾乎是hive的3x倍，越複雜的sql，hive的性能損失越多，而shark針對複雜SQL有優化。

原創文章，轉載請註明出處，出自：http://blog.csdn.net/oopsoom/article/details/34438963

-EOF-

一、集羣環境

二、軟件環境

三、測試對象

四、測試用例及結果

以下是測試結果圖：

測試count、sum、avg、group by

測試join,select,sort

測試複雜sql

總結：

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

【2024-05-21】以茶會友

Spark機器學習庫mllib之協同過濾

Shark性能測試

Spark SQL Catalyst源碼分析之UDF

Spark SQL with Hive

Spark SQL Catalyst源碼分析之TreeNode Library

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結