spark sql 高階函數介紹

背景

An Introduction to Higher Order Functions in Spark SQL

Nested data types offer Apache Spark users powerful ways to manipulate structured data. In particular, they allow you to put complex objects like arrays, maps and structures inside of columns. This can help you model your data in a more natural way.

While this feature is certainly useful, it can quite bit cumbersome to manipulate data inside of complex objects because SQL (and Spark) do not have primitives for working with such data. In addition, it is time-consuming, non-performant, and non-trivial. During this talk we will discuss some of the commonly used techniques for working with complex objects, and we will introduce new ones based on Higher-order functions. Higher-order functions will be part of Spark 2.4 and are a simple and performant extension to SQL that allow a user to manipulate complex data such as arrays.

視頻地址

transform

對array中的每個元素進行同樣的操作

  1. 案例,初始數據 表名爲data

data =====> createTempView(data, “data”)

id sum reduce
1 2 1
2 5 3
  1. 合併數組元素

result <- sql(“select *,array(sum,reduce) as merge from data”)
createTempView(result, “result”)

id sum reduce merge
1 2 1 2,1
2 5 3 5,3
  1. 使用 高階函數 tranform 對 result (array類型)中的每個元素 +1操作

result <- sql(“select *,transform(merge,merge-> merge+ 1) as final from result”)
createTempView(final, “final”)

id sum reduce result final
1 2 1 2,1 3,2
2 5 3 5,3 6,4

使用高階函數可以提升顯著性能,如果使用老方法,必須要先explode,把array先分解,然後再group by 唯一值進行collect_list聚合,此時group by會涉及shuffle,因此會比較耗費性能

SELECT id,
collect_list(val + 1) AS vals
FROM (SELECT id,
explode(vals) AS val
FROM input_tbl) x
GROUP BY id

也可以使用udf,但是會序列化數據,此時也是比較昂貴的操作

def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}
val plusOneInt = spark.udf.register(“plusOneInt”, addOne(_: Seq[Int]): Seq[Int])

SELECT id, plusOneInt(vals) as vals FROM input_tbl

transform 嵌套執行(nest)

當array裏面套array的時候使用

SELECT key,
nested_values,
TRANSFORM(nested_values,
values -> TRANSFORM(values,
value -> value + key + SIZE(values))) AS new_nested_values
FROM nested_data

exists

表示array中元素的存在性

我們使用如上結果

createTempView(result, “result”)

id sum reduce merge
1 2 1 2,1
2 5 3 5,3

判斷merge中的元素是否存在1

sql(“select *,exists(merge, merge_value -> merge_value==1) as exists from result”)

id sum reduce merge exists
1 2 1 2,1 TRUE
2 5 3 5,3 FALSE

aggregate 聚合

直接進入 高級聚合

聚合函數的第二個參數可以爲數值,也可以爲一個元祖,用來初始化值,其運作邏輯就是普通的reduce函數

SELECT key,
values,
AGGREGATE(values,
(1.0 AS product, 0 AS N),
(buffer, value) -> (value * buffer.product, buffer.N + 1),
buffer -> Power(buffer.product, 1.0 / buffer.N)) geomean
FROM nested_data

如上表示能夠實現geomean計算

databricks專題

spark-2.4 notebook

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章