SparkSQL相關語句總結

1.in 不支持子查詢

eg. select * from src where key in(select key from test);
支持查詢個數 eg. select * from src where key in(1,2,3,4,5);
in 40000個 耗時25.766秒
in 80000個 耗時78.827秒

2.union all/union

不支持頂層的union all eg. select key from src UNION ALL select key from test;
支持select * from (select key from src union all select key from test)aa;
不支持 union
支持select distinct key from (select key from src union all select key from test)aa;

3.intersect 不支持

4.minus 不支持

5.except 不支持

6.inner join/join/left outer join/right outer join/full outer join/left semi join 都支持

left outer join/right outer join/full outer join 中間必須有outer

join是最簡單的關聯操作，兩邊關聯只取交集;

left outer join是以左表驅動，右表不存在的key均賦值爲null；

full outer join全表關聯，將兩表完整的進行笛卡爾積操作，左右表均可賦值爲null;

Hive不支持where子句中的子查詢，SQL常用的exist in子句在Hive中是不支持的

可用以下兩種方式替換：
select * from src aa left outer join test bb on aa.key=bb.key where bb.key <> null;
select * from src aa left semi join test bb on aa.key=bb.key;
大多數情況下 JOIN ON 和 left semi on 是對等的
A,B兩表連接，如果B表存在重複數據
當使用JOIN ON的時候，A,B表會關聯出兩條記錄，應爲ON上的條件符合； 
而是用LEFT SEMI JOIN 當A表中的記錄，在B表上產生符合條件之後就返回，不會再繼續查找B表記錄了， 所以如果B表有重複，也不會產生重複的多條記錄。 
left outer join 支持子查詢 eg. select aa.* from src aa left outer join (select * from test111)bb on aa.key=bb.a;

7. hive四中數據導入方式

1）從本地文件系統中導入數據到Hive表

create table wyp(id int,name string) ROW FORMAT delimited fields terminated by '\t' STORED AS TEXTFILE;
load data local inpath 'wyp.txt' into table wyp;

2)從HDFS上導入數據到Hive表

$>bin/hadoop fs -cat /home/wyp/add.txt
hive> load data inpath '/home/wyp/add.txt' into table wyp;

3)從別的表中查詢出相應的數據並導入到Hive表中

hive> create table test(
> id int, name string
> ,tel string)
> partitioned by
> (age int)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t'
> STORED AS TEXTFILE;

注：test表裏面用age作爲了分區字段，分區：在Hive中，表的每一個分區對應表下的相應目錄，所有分區的數據都是存儲在對應的目錄中。
比如wyp表有dt和city兩個分區，則對應dt=20131218city=BJ對應表的目錄爲/user/hive/warehouse/dt=20131218/city=BJ，
所有屬於這個分區的數據都存放在這個目錄中。

hive> insert into table test
> partition (age='25')
> select id, name, tel
> from wyp;

也可以在select語句裏面通過使用分區值來動態指明分區：

hive> set hive.exec.dynamic.partition.mode=nonstrict;
hive> insert into table test
> partition (age)
> select id, name,
> tel, age
> from wyp;

Hive也支持insert overwrite方式來插入數據

hive> insert overwrite table test
> PARTITION (age)
> select id, name, tel, age
> from wyp;

Hive還支持多表插入

hive> from wyp
> insert into table test
> partition(age)
> select id, name, tel, age
> insert into table test3
> select id, name
> where age>25;

4)在創建表的時候通過從別的表中查詢出相應的記錄並插入到所創建的表中
hive> create table test4 as select id, name, tel from wyp;

8.查看建表語句

hive> show create table test3;

9.表重命名

hive> ALTER TABLE events RENAME TO 3koobecaf;

10.表增加列

hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);

11.添加一列並增加列字段註釋

hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');

12.刪除表

hive> DROP TABLE pokes;

13.top n

hive> select * from test order by key limit 10;

14.修改列的名稱和類型

Create Database baseball;
alter table yangsy CHANGE product_no phone_no string

ThriftServer 開啓FAIR模式

SparkSQL Thrift Server 開啓FAIR調度方式:
1. 修改SPARKHOME/conf/spark−defaults.conf,新增2.spark.scheduler.modeFAIR3.spark.scheduler.allocation.file/Users/tianyi/github/community/apache−spark/conf/fair−scheduler.xml4.修改 SPARK_HOME/conf/fair-scheduler.xml(或新增該文件), 編輯如下格式內容

<?xml version="1.0"?>
<allocations>
<pool name="production">
    <schedulingMode>FAIR</schedulingMode>
    <!-- weight表示兩個隊列在minShare相同的情況下,可以使用資源的比例 -->
    <weight>1</weight>
    <!-- minShare表示優先保證的資源數 -->
    <minShare>2</minShare>
</pool>
<pool name="test">
    <schedulingMode>FIFO</schedulingMode>
    <weight>2</weight>
    <minShare>3</minShare>
</pool>
</allocations>

重啓Thrift Server
執行SQL前,執行
set spark.sql.thriftserver.scheduler.pool=指定的隊列名

等操作完了 create table yangsy555 like CI_CUSER_YYMMDDHHMISSTTTTTT 然後insert into yangsy555 select * from yangsy555

創建一個自增序列表，使用row_number() over()爲表增加序列號以供分頁查詢

create table yagnsytest2 as SELECT ROW_NUMBER() OVER() as id,* from yangsytest;

Sparksql的解析與Hiveql的解析的執行流程:

就問你吃不吃藥

發佈了14 篇原創文章 · 獲贊 13 · 訪問量 9萬+

私信關注

SparkSQL相關語句總結

1.in 不支持子查詢

2.union all/union

3.intersect 不支持

4.minus 不支持

5.except 不支持

6.inner join/join/left outer join/right outer join/full outer join/left semi join 都支持

7. hive四中數據導入方式

8.查看建表語句

9.表重命名

10.表增加列

11.添加一列並增加列字段註釋

12.刪除表

13.top n

14.修改列的名稱和類型

ThriftServer 開啓FAIR模式

Win10 LTSC 2019 安裝後的一些步驟

推薦2款開源、美觀的WinForm UI控件庫

NET9 AspnetCore將整合OpenAPI的文檔生成功能而無需三方庫

在Linux下管理MySQL的大小寫敏感性

Spark相關問題的故障排除

Hive自定義UDF函數

Spark+Parquet分片規則

解決value toDF is not a member of org.apache.spark.rdd.RDD[People]

HAProxy介紹

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結