接上一篇文章可能是史上覆蓋flinksql功能最全的demo–part1
Flink SQL join Table的5種方式
靜態表常規join
靜態表常規join指的是:靜態表join靜態表
例:按地區和優先級顯示特定日期的客戶及其訂單
-- 訂單表dev_orders(基於S3的靜態表) join MySQL表
SET execution.type=batch;
USE CATALOG hive;
SELECT
r_name AS `region`,
o_orderpriority AS `priority`,
COUNT(DISTINCT c_custkey) AS `number_of_customers`,
COUNT(o_orderkey) AS `number_of_orders`
FROM dev_orders
JOIN prod_customer ON o_custkey = c_custkey
JOIN prod_nation ON c_nationkey = n_nationkey
JOIN prod_region ON n_regionkey = r_regionkey
WHERE
FLOOR(o_ordertime TO DAY) = TIMESTAMP '2020-04-01 0:00:00.000'
AND NOT o_orderpriority = '4-NOT SPECIFIED'
GROUP BY r_name, o_orderpriority
ORDER BY r_name, o_orderpriority;
動態表常規join
動態表常規join指的是:動態表join靜態表
例:將上例中的靜態訂單表改爲動態表,查詢相同也的業務邏輯
-- 將靜態訂單表dev_orders改爲動態訂單表prod_orders,移除ORDER BY子句(流處理引擎不支持)
SET execution.type=streaming;
USE CATALOG hive;
SELECT
r_name AS `region`,
o_orderpriority AS `priority`,
COUNT(DISTINCT c_custkey) AS `number_of_customers`,
COUNT(o_orderkey) AS `number_of_orders`
FROM default_catalog.default_database.prod_orders
JOIN prod_customer ON o_custkey = c_custkey
JOIN prod_nation ON c_nationkey = n_nationkey
JOIN prod_region ON n_regionkey = r_regionkey
WHERE
FLOOR(o_ordertime TO DAY) = TIMESTAMP '2020-04-01 0:00:00.000'
AND NOT o_orderpriority = '4-NOT SPECIFIED'
GROUP BY r_name, o_orderpriority;
注意:
- 靜態表只會在任務啓動時加載一次,數據更新後無法反饋到已經啓動的任務中
- 所有輸入表的數據都會被flink寫到狀態中
時間區間join(Interval Join)
時間區間join通常用於類似需求:將兩個(或多個)動態表的事件進行join,這些動態表在一個時間上下文中相互關聯,例如在同一時間發生的事件。Flink SQL對這種連接進行了特殊的優化。
例:將子訂單表和訂單表進行關聯,找到緊急狀態的未付款子訂單
USE CATALOG default_catalog;
SELECT
o_ordertime AS `ordertime`,
o_orderkey AS `order`,
l_linenumber AS `linenumber`,
l_partkey AS `part`,
l_suppkey AS `supplier`,
l_quantity AS `quantity`
FROM prod_lineitem
JOIN prod_orders ON o_orderkey = l_orderkey
WHERE
l_ordertime BETWEEN o_ordertime - INTERVAL '5' MINUTE AND o_ordertime AND
l_linestatus = 'O' AND
o_orderpriority = '1-URGENT';
注意:
- where條件中左表和右表必須有基於Event-time語義或Processin-time語義的關聯條件,本例中爲:
l_ordertime BETWEEN o_ordertime - INTERVAL '5' MINUTE AND o_ordertime
- 本例中,要求l_ordertime BETWEEN o_ordertime - INTERVAL ‘5’ MINUTE AND o_ordertime,所以在flink state中只保留近5分鐘的父訂單數據即可,減小了對flink內存的要求。
臨時表join(Enrichment Join with Lookup Table in MySQL)
即Temporal Table Join,適用於僅插入(insert-only)動態表join靜態表(無更新或更新頻率較低)。
例:子訂單表prod_lineitem(動態表)join 實時匯率表 prod_rates,用來計算人民幣訂單金額。
USE CATALOG default_catalog;
SELECT
l_proctime AS `querytime`,
l_orderkey AS `order`,
l_linenumber AS `linenumber`,
l_currency AS `currency`,
rs_rate AS `cur_rate`,
(l_extendedprice * (1 - l_discount) * (1 + l_tax)) / rs_rate AS `open_in_euro`
FROM prod_lineitem
JOIN hive.`default`.prod_rates FOR SYSTEM_TIME AS OF l_proctime ON rs_symbol = l_currency
WHERE
l_linestatus = 'O'
AND l_currency = 'CNY';
查詢結果:
如上圖,人民幣匯率8.0166。
接下來,修改mysql維表中的人民幣匯率爲9.999:
# 修改人民幣匯率
docker-compose exec mysql mysql -Dsql-demo -usql-demo -pdemo-sql
SELECT * FROM PROD_RATES;
UPDATE PROD_RATES SET RS_TIMESTAMP = '2020-04-01 01:00:00.000', RS_RATE = 9.999 WHERE RS_SYMBOL='CNY';
實時join的結果中,匯率也變爲9.999:
注意:
- processing-time語義:根據processing-time去關聯靜態表(匯率表)mysql中的行
- mysql維表的更新會實時反饋到正在運行的job中
關鍵語法:
JOIN hive.`default`.prod_rates FOR SYSTEM_TIME AS OF l_proctime ON rs_symbol = l_currency
在join中指定動態表processing-time字段(l_proctime):FOR SYSTEM_TIME AS OF l_proctime
臨時表函數join(Enrichment Join against Temporal Table)
Temporal Table Function Join指的是,通過join變更日誌,進行某個事件時間點精確關聯。
例:通過關聯訂單產生時刻的匯率,計算各幣種的訂單金額。
以Temporal Table Join中的案例需求爲例,將mysql維表改爲kafka維表(匯率變化時向kafka中寫入最新匯率)。
使用TemporalTableFunction prod_rates_temporal 查詢最新匯率:
USE CATALOG default_catalog;
SELECT
l_ordertime AS `ordertime`,
l_orderkey AS `order`,
l_linenumber AS `linenumber`,
l_currency AS `currency`,
rs_rate AS `cur_rate`,
(l_extendedprice * (1 - l_discount) * (1 + l_tax)) / rs_rate AS `open_in_euro`
FROM
prod_lineitem,
LATERAL TABLE(prod_rates_temporal(l_ordertime))
WHERE rs_symbol = l_currency AND
l_linestatus = 'O';
結果:
注意:
- Event-time語義:以Event-time爲依據,關聯temporal table(kafka topic)中的行(匯率)
- 匯率變化通過向kafka topic中produce一條數據的方式變更
關鍵語法:
LATERAL TABLE(prod_rates_temporal(l_ordertime))
- LATERAL TABLE:Temporal Table Function 關聯關鍵字
- prod_rates_temporal(l_ordertime):指向匯率變更日誌的function,以Event-time作爲參數
- 截止flink1.10版本,僅支持inner join