Spark SQL簡單操作演示(含導出表)

Spark SQL前身 是Shark,由於Shark對於Hive的太多依賴制約了Spark的發展,Spark SQL由此產生。
Spark SQL只要在編譯的時候引入Hive支持,就可以支持Hive表訪問,UDF,SerDe,以及HiveQL/HQL
這裏寫圖片描述

啓動spark-sql

$>spark-sql   
16/05/15 21:20:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable  

因爲什麼都沒配置,它會使用自帶的derby存儲,在哪啓動的就存在哪,產生一個metastore_db文件夾和一個derby.log文件,如下:
這裏寫圖片描述
spark-sql可以像hive一樣把寫好的sql直接source執行

spark-sql> source /home/guo/1.sql  

但速度比hive快得多
1.sql

drop table if exists cainiao;  
create external table cainiao(dater bigint, item_id bigint, store_code bigint, qty_alipay_njhs bigint)   
row format delimited fields terminated by ','   
location '/cainiao';  

create table predict as select item_id, store_code, sum(qty_alipay_njhs) as target   
from cainiao where dater>=20141228 and dater<=20150110 group by item_id, store_code;  

drop table if exists cainiaoq;  
create external table cainiaoq(dater bigint, item_id bigint, qty_alipay_njhs bigint)   
row format delimited fields terminated by ','   
location '/cainiaoq';  

create table predictq as select item_id, "all" as store_code, sum(qty_alipay_njhs) as  target   
from cainiaoq where dater>=20141228 and dater<=20150110 group by item_id;  

表名後的false意思是該表不是臨時表

spark-sql> show tables;  
cainiao false  
cainiaoq    false  
predict false  
predictq    false 

hive裏的大多數語法spark sql可以用,比如上面的創建外部表,但將表導出不能用

spark-sql> insert overwrite local directory '/home/guo/cainiaodiqu'   
         > row format delimited   
         > fields terminated by ','   
         > select * from predict;  
Error in query:   
Unsupported language features in query: insert overwrite local directory '/home/guo/cainiaodiqu'   
row format delimited   
fields terminated by ','   
select * from predict  

如何把表裏的數據導到文件系統上
官方文檔:https://spark.apache.org/docs/latest/sql-programming-guide.html

第一個辦法

因爲hive如果什麼都沒配,也會用自帶的derby存儲,也是在哪啓動的就存在哪,所以只要在相同目錄下啓動,在spark-sql裏創建的表,hive裏當然也有了,就可以用上面spark-sql不支持的語句導出了,spark sql和hive往往會配置使用同一個元數據庫。
$> hive  
hive> show tables;  
OK  
cainiao  
cainiaoq  
ijcai  
ijcaitest  
ijpredict  
predict  
predictq  
Time taken: 2.136 seconds, Fetched: 7 row(s) 

第二個辦法
啓動spark-shell

guo@drguo:/opt/spark-1.6.1-bin-hadoop2.6/bin$ spark-shell   
16/05/15 20:30:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable  
Welcome to  
      ____              __  
     / __/__  ___ _____/ /__  
    _\ \/ _ \/ _ `/ __/  '_/  
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.1  
      /_/  

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_73)  
Type in expressions to have them evaluated.  
Type :help for more information.  
Spark context available as sc.  
SQL context available as sqlContext.

因爲我spark-shell和spark-sql是在同一個目錄下啓動的,剛纔創建的表當然還有啦(配置了元數據庫之後,一般用mysql,就不用在同一個目錄下啓動了)

scala> sqlContext.sql("show tables").show  
+---------+-----------+  
|tableName|isTemporary|  
+---------+-----------+  
|  cainiao|      false|  
| cainiaoq|      false|  
|  predict|      false|  
| predictq|      false|  
+---------+-----------+  


scala> sqlContext.sql("select * from predict limit 10").show  
+-------+----------+------+  
|item_id|store_code|target|  
+-------+----------+------+  
|     33|         2|     1|  
|     33|         3|     0|  
|     33|         4|     4|  
|     33|         5|     1|  
|    132|         1|     0|  
|    132|         2|     1|  
|    132|         3|     1|  
|    330|         5|     1|  
|    549|         1|     3|  
|    549|         2|     2|  
+-------+----------+------+  

將表導出

scala> sqlContext.sql("select * from predict ").write.format("json").save("predictj")  
scala> sqlContext.sql("select * from predict ").write.format("parquet").save("predictp")  
scala> sqlContext.sql("select * from predict ").write.format("orc").save("predicto")  
發佈了14 篇原創文章 · 獲贊 13 · 訪問量 9萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章