spark sql 高階函數介紹

原創

乾坤瞬间

2020-06-19 13:03

文章目錄

背景

transform

transform 嵌套執行（nest）

背景

An Introduction to Higher Order Functions in Spark SQL

Nested data types offer Apache Spark users powerful ways to manipulate structured data. In particular, they allow you to put complex objects like arrays, maps and structures inside of columns. This can help you model your data in a more natural way.

While this feature is certainly useful, it can quite bit cumbersome to manipulate data inside of complex objects because SQL (and Spark) do not have primitives for working with such data. In addition, it is time-consuming, non-performant, and non-trivial. During this talk we will discuss some of the commonly used techniques for working with complex objects, and we will introduce new ones based on Higher-order functions. Higher-order functions will be part of Spark 2.4 and are a simple and performant extension to SQL that allow a user to manipulate complex data such as arrays.

視頻地址

transform

對array中的每個元素進行同樣的操作

案例，初始數據表名爲data

data =====> createTempView(data, “data”)

id	sum	reduce
1	2	1
2	5	3

合併數組元素

result <- sql(“select *,array(sum,reduce) as merge from data”)
createTempView(result, “result”)

id	sum	reduce	merge
1	2	1	2,1
2	5	3	5,3

使用高階函數 tranform 對 result （array類型）中的每個元素 +1操作

result <- sql(“select *,transform(merge,merge-> merge+ 1) as final from result”)
createTempView(final, “final”)

id	sum	reduce	result	final
1	2	1	2,1	3,2
2	5	3	5,3	6,4

使用高階函數可以提升顯著性能，如果使用老方法，必須要先explode，把array先分解，然後再group by 唯一值進行collect_list聚合，此時group by會涉及shuffle，因此會比較耗費性能

SELECT id,
collect_list(val + 1) AS vals
FROM (SELECT id,
explode(vals) AS val
FROM input_tbl) x
GROUP BY id

也可以使用udf,但是會序列化數據，此時也是比較昂貴的操作

def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}
val plusOneInt = spark.udf.register(“plusOneInt”, addOne(_: Seq[Int]): Seq[Int])

SELECT id, plusOneInt(vals) as vals FROM input_tbl

transform 嵌套執行（nest）

當array裏面套array的時候使用

SELECT key,
nested_values,
TRANSFORM(nested_values,
values -> TRANSFORM(values,
value -> value + key + SIZE(values))) AS new_nested_values
FROM nested_data

exists

表示array中元素的存在性

我們使用如上結果

createTempView(result, “result”)

id	sum	reduce	merge
1	2	1	2,1
2	5	3	5,3

判斷merge中的元素是否存在1

sql(“select *,exists(merge, merge_value -> merge_value==1) as exists from result”)

id	sum	reduce	merge	exists
1	2	1	2,1	TRUE
2	5	3	5,3	FALSE

aggregate 聚合

直接進入高級聚合

聚合函數的第二個參數可以爲數值，也可以爲一個元祖，用來初始化值，其運作邏輯就是普通的reduce函數

SELECT key,
values,
AGGREGATE(values,
(1.0 AS product, 0 AS N),
(buffer, value) -> (value * buffer.product, buffer.N + 1),
buffer -> Power(buffer.product, 1.0 / buffer.N)) geomean
FROM nested_data

如上表示能夠實現geomean計算

databricks專題

spark-2.4 notebook

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

spark sql 高階函數介紹

文章目錄

背景

transform

transform 嵌套執行（nest）

exists

aggregate 聚合

.Net 8.0 下的新RPC，IceRPC之試試的新玩法"打洞"

關於遊戲付費的一點想法

我通過CKA和CKS啦！

《最新出爐》系列入門篇-Python+Playwright自動化測試-42-強大的可視化追蹤利器Trace Viewer

大數據怎麼學？對大數據開發領域及崗位的詳細解讀，完整理解大數據開發領域技術體系

Linux shell 整理之基本概念篇（二）

spark pom文件胖廋包結合

Linux shell 整理之語法結構篇（五）

spark streaming2.4.0 任務啓動源碼剖析

spark sql 高階函數介紹

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結