spark(hive) sql 中的concat_ws 有兩個易出錯的地方
- 無序
- 忽略Null
concat_ws 有序
下面示例生成用戶的行爲序列,按時間有序
方法一:使用 window partition + row_number
SELECT *
FROM (
SELECT
user_id
, concat_ws(' ', collect_list(event) OVER (PARTITION BY user_id ORDER BY event_time)) AS event_list
, row_number() OVER (PARTITION BY user_id ORDER BY event_time DESC) AS rn
FROM
table
) b where rn 1
方法二:使用group by + sort_array
create event_table
as
SELECT
user_id
,sort_array(collect_list(concat_ws('_', event_time, event))) AS time_event_list
FROM
table
group by
user_id;
udf 去掉event_time,保留event
time_length = len('2019-11-21 22:06:55.564_')
def event_list(time_event_text):
time_event_list = time_event_text.split('~')
event_list = [ x[time_length:] for x in time_event_list ]
return ' '.join(event_list)
spark.udf.register("event_list", lambda x: event_list(x))
spark.sql("select user_id, event_list(concat_ws('~', time_action_list)) event_list from event_table"
方法一非常耗時,每一行都生成action字段,通過rn取一行;
方法二是多行並一行,執行較快
concat_ws忽略Null
SELECT CONCAT_WS(',','First name',NULL,'Last Name');
--返回 First name,Last Name
使用nvl進行防禦編程
SELECT CONCAT_WS(',','First name', NVL(NULL, ''),'Last Name');
--返回 First name,,Last Name