背景
An Introduction to Higher Order Functions in Spark SQL
Nested data types offer Apache Spark users powerful ways to manipulate structured data. In particular, they allow you to put complex objects like arrays, maps and structures inside of columns. This can help you model your data in a more natural way.
While this feature is certainly useful, it can quite bit cumbersome to manipulate data inside of complex objects because SQL (and Spark) do not have primitives for working with such data. In addition, it is time-consuming, non-performant, and non-trivial. During this talk we will discuss some of the commonly used techniques for working with complex objects, and we will introduce new ones based on Higher-order functions. Higher-order functions will be part of Spark 2.4 and are a simple and performant extension to SQL that allow a user to manipulate complex data such as arrays.
transform
對array中的每個元素進行同樣的操作
- 案例,初始數據 表名爲data
data =====> createTempView(data, “data”)
id | sum | reduce |
---|---|---|
1 | 2 | 1 |
2 | 5 | 3 |
- 合併數組元素
result <- sql(“select *,array(sum,reduce) as merge from data”)
createTempView(result, “result”)
id | sum | reduce | merge |
---|---|---|---|
1 | 2 | 1 | 2,1 |
2 | 5 | 3 | 5,3 |
- 使用 高階函數 tranform 對 result (array類型)中的每個元素 +1操作
result <- sql(“select *,transform(merge,merge-> merge+ 1) as final from result”)
createTempView(final, “final”)
id | sum | reduce | result | final |
---|---|---|---|---|
1 | 2 | 1 | 2,1 | 3,2 |
2 | 5 | 3 | 5,3 | 6,4 |
使用高階函數可以提升顯著性能,如果使用老方法,必須要先explode,把array先分解,然後再group by 唯一值進行collect_list聚合,此時group by會涉及shuffle,因此會比較耗費性能
SELECT id,
collect_list(val + 1) AS vals
FROM (SELECT id,
explode(vals) AS val
FROM input_tbl) x
GROUP BY id
也可以使用udf,但是會序列化數據,此時也是比較昂貴的操作
def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}
val plusOneInt = spark.udf.register(“plusOneInt”, addOne(_: Seq[Int]): Seq[Int])
SELECT id, plusOneInt(vals) as vals FROM input_tbl
transform 嵌套執行(nest)
當array裏面套array的時候使用
SELECT key,
nested_values,
TRANSFORM(nested_values,
values -> TRANSFORM(values,
value -> value + key + SIZE(values))) AS new_nested_values
FROM nested_data
exists
表示array中元素的存在性
我們使用如上結果
createTempView(result, “result”)
id | sum | reduce | merge |
---|---|---|---|
1 | 2 | 1 | 2,1 |
2 | 5 | 3 | 5,3 |
判斷merge中的元素是否存在1
sql(“select *,exists(merge, merge_value -> merge_value==1) as exists from result”)
id | sum | reduce | merge | exists |
---|---|---|---|---|
1 | 2 | 1 | 2,1 | TRUE |
2 | 5 | 3 | 5,3 | FALSE |
aggregate 聚合
直接進入 高級聚合
聚合函數的第二個參數可以爲數值,也可以爲一個元祖,用來初始化值,其運作邏輯就是普通的reduce函數
SELECT key,
values,
AGGREGATE(values,
(1.0 AS product, 0 AS N),
(buffer, value) -> (value * buffer.product, buffer.N + 1),
buffer -> Power(buffer.product, 1.0 / buffer.N)) geomean
FROM nested_data
如上表示能夠實現geomean計算