Spark SQL前身 是Shark,由於Shark對於Hive的太多依賴制約了Spark的發展,Spark SQL由此產生。
Spark SQL只要在編譯的時候引入Hive支持,就可以支持Hive表訪問,UDF,SerDe,以及HiveQL/HQL
啓動spark-sql
$>spark-sql
16/05/15 21:20:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
因爲什麼都沒配置,它會使用自帶的derby存儲,在哪啓動的就存在哪,產生一個metastore_db文件夾和一個derby.log文件,如下:
spark-sql可以像hive一樣把寫好的sql直接source執行
spark-sql> source /home/guo/1.sql
但速度比hive快得多
1.sql
drop table if exists cainiao;
create external table cainiao(dater bigint, item_id bigint, store_code bigint, qty_alipay_njhs bigint)
row format delimited fields terminated by ','
location '/cainiao';
create table predict as select item_id, store_code, sum(qty_alipay_njhs) as target
from cainiao where dater>=20141228 and dater<=20150110 group by item_id, store_code;
drop table if exists cainiaoq;
create external table cainiaoq(dater bigint, item_id bigint, qty_alipay_njhs bigint)
row format delimited fields terminated by ','
location '/cainiaoq';
create table predictq as select item_id, "all" as store_code, sum(qty_alipay_njhs) as target
from cainiaoq where dater>=20141228 and dater<=20150110 group by item_id;
表名後的false意思是該表不是臨時表
spark-sql> show tables;
cainiao false
cainiaoq false
predict false
predictq false
hive裏的大多數語法spark sql可以用,比如上面的創建外部表,但將表導出不能用
spark-sql> insert overwrite local directory '/home/guo/cainiaodiqu'
> row format delimited
> fields terminated by ','
> select * from predict;
Error in query:
Unsupported language features in query: insert overwrite local directory '/home/guo/cainiaodiqu'
row format delimited
fields terminated by ','
select * from predict
如何把表裏的數據導到文件系統上
官方文檔:https://spark.apache.org/docs/latest/sql-programming-guide.html
第一個辦法
因爲hive如果什麼都沒配,也會用自帶的derby存儲,也是在哪啓動的就存在哪,所以只要在相同目錄下啓動,在spark-sql裏創建的表,hive裏當然也有了,就可以用上面spark-sql不支持的語句導出了,spark sql和hive往往會配置使用同一個元數據庫。
$> hive
hive> show tables;
OK
cainiao
cainiaoq
ijcai
ijcaitest
ijpredict
predict
predictq
Time taken: 2.136 seconds, Fetched: 7 row(s)
第二個辦法
啓動spark-shell
guo@drguo:/opt/spark-1.6.1-bin-hadoop2.6/bin$ spark-shell
16/05/15 20:30:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.1
/_/
Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_73)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.
因爲我spark-shell和spark-sql是在同一個目錄下啓動的,剛纔創建的表當然還有啦(配置了元數據庫之後,一般用mysql,就不用在同一個目錄下啓動了)
scala> sqlContext.sql("show tables").show
+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
| cainiao| false|
| cainiaoq| false|
| predict| false|
| predictq| false|
+---------+-----------+
scala> sqlContext.sql("select * from predict limit 10").show
+-------+----------+------+
|item_id|store_code|target|
+-------+----------+------+
| 33| 2| 1|
| 33| 3| 0|
| 33| 4| 4|
| 33| 5| 1|
| 132| 1| 0|
| 132| 2| 1|
| 132| 3| 1|
| 330| 5| 1|
| 549| 1| 3|
| 549| 2| 2|
+-------+----------+------+
將表導出
scala> sqlContext.sql("select * from predict ").write.format("json").save("predictj")
scala> sqlContext.sql("select * from predict ").write.format("parquet").save("predictp")
scala> sqlContext.sql("select * from predict ").write.format("orc").save("predicto")