窗口(Windows)
- 時間語義,要配合窗口操作才能發揮作用。最主要的用途,當然就是開窗口、根據時間段做計算了。下面我們就來看看Table API和SQL中,怎麼利用時間字段做窗口操作。
- 在Table API和SQL中,主要有兩種窗口:Group Windows和Over Windows
分組窗口(Group Windows)
- Group Windows 是使用 window(w:GroupWindow)子句定義的,並且必須由as子句指定一個別名。
- 爲了按窗口對錶進行分組,窗口的別名必須在 group by 子句中,像常規的分組字段一樣引用
- Table API 提供了一組具有特定語義的預定義 Window 類,這些類會被轉換爲底層 DataStream 或 DataSet 的窗口操作
- 分組窗口分爲三種:滾動窗口、滑動窗口、會話窗口
滾動窗口(Tumbling windows):
- 滾動窗口(Tumbling windows)要用Tumble類來定義
- over:定義窗口長度
- on:用來分組(按時間間隔)或者排序(按行數)的時間字段
- as:別名,必須出現在後面的groupBy中
// Tumbling Event-time Window(事件時間字段rowtime)
.window(Tumble over 10.minutes on 'rowtime as 'w)
// Tumbling Processing-time Window(處理時間字段proctime)
.window(Tumble over 10.minutes on 'proctime as 'w)
// Tumbling Row-count Window (類似於計數窗口,按處理時間排序,10行一組)
.window(Tumble over 10.rows on 'proctime as 'w)
滑動窗口(Sliding windows):
- 滑動窗口(Sliding windows)要用Slide類來定義
- over:定義窗口長度
- every:定義滑動步長
- on:用來分組(按時間間隔)或者排序(按行數)的時間字段
- as:別名,必須出現在後面的groupBy中
// Sliding Event-time Window
.window(Slide over 10.minutes every 5.minutes on 'rowtime as 'w)
// Sliding Processing-time window
.window(Slide over 10.minutes every 5.minutes on 'proctime as 'w)
// Sliding Row-count window
.window(Slide over 10.rows every 5.rows on 'proctime as 'w)
會話窗口(Session windows):
- 會話窗口(Session windows)要用Session類來定義
- withGap:會話時間間隔
- on:用來分組(按時間間隔)或者排序(按行數)的時間字段
- as:別名,必須出現在後面的groupBy中
// Session Event-time Window
.window(Session withGap 10.minutes on 'rowtime as 'w)
// Session Processing-time Window
.window(Session withGap 10.minutes on 'proctime as 'w)
Over Windows
- Over window 聚合是標準 SQL 中已有的(over 子句),可以在查詢的 SELECT 子句中定義
- Over window 聚合,會針對每個輸入行,計算相鄰行範圍內的聚合
- Over windows 使用 window(w:overwindows*)子句定義,並在 select()方法中通過別名來引用
- Table API 提供了 Over 類,來配置 Over 窗口的屬性
- 可以在事件時間或處理時間,以及指定爲時間間隔、或行計數的範圍內,定義 Over windows
- 無界的 over window 是使用常量指定的
val table = input
.window([w: OverWindow] as 'w)
.select('a, 'b.sum over 'w, 'c.min over 'w)
無界 Over Windows
// 無界的事件時間over window (時間字段 "rowtime")
.window(Over partitionBy 'a orderBy 'rowtime preceding UNBOUNDED_RANGE as 'w)
//無界的處理時間over window (時間字段"proctime")
.window(Over partitionBy 'a orderBy 'proctime preceding UNBOUNDED_RANGE as 'w)
// 無界的事件時間Row-count over window (時間字段 "rowtime")
.window(Over partitionBy 'a orderBy 'rowtime preceding UNBOUNDED_ROW as 'w)
//無界的處理時間Row-count over window (時間字段 "rowtime")
.window(Over partitionBy 'a orderBy 'proctime preceding UNBOUNDED_ROW as 'w)
有界的over window
// 有界的事件時間over window (時間字段 "rowtime",之前1分鐘)
.window(Over partitionBy 'a orderBy 'rowtime preceding 1.minutes as 'w)
// 有界的處理時間over window (時間字段 "rowtime",之前1分鐘)
.window(Over partitionBy 'a orderBy 'proctime preceding 1.minutes as 'w)
// 有界的事件時間Row-count over window (時間字段 "rowtime",之前10行)
.window(Over partitionBy 'a orderBy 'rowtime preceding 10.rows as 'w)
// 有界的處理時間Row-count over window (時間字段 "rowtime",之前10行)
.window(Over partitionBy 'a orderBy 'proctime preceding 10.rows as 'w)
SQL 中的 Group Windows
另外還有一些輔助函數,可以用來選擇Group Window的開始和結束時間戳,以及時間屬性。
這裏只寫TUMBLE_,滑動和會話窗口是類似的(HOP_,SESSION_*)。
- TUMBLE_START(time_attr, interval)
- TUMBLE_END(time_attr, interval)
- TUMBLE_ROWTIME(time_attr, interval)
- TUMBLE_PROCTIME(time_attr, interval)
SQL 中的 Over Windows
- 用 Over 做窗口聚合時,所有聚合必須在同一窗口上定義,也就是說必須是相同的分區、排序和範圍
- 目前僅支持在當前行範圍之前的窗口
- ORDER BY 必須在單一的時間屬性上指定
SELECT COUNT(amount) OVER (
PARTITION BY user
ORDER BY proctime
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
FROM Orders
// 也可以做多個聚合
SELECT COUNT(amount) OVER w, SUM(amount) OVER w
FROM Orders
WINDOW w AS (
PARTITION BY user
ORDER BY proctime
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
代碼實操
import com.atguigu.bean.SensorReading
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.{Over, Table, Tumble}
import org.apache.flink.table.api.scala._
import org.apache.flink.types.Row
object TimeAndWindowTest {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
// 創建表執行環境
val tableEnv = StreamTableEnvironment.create(env)
val inputStream: DataStream[String] = env.readTextFile("D:\\MyWork\\WorkSpaceIDEA\\flink-tutorial\\src\\main\\resources\\SensorReading.txt")
// map成樣例類類型
val dataStream: DataStream[SensorReading] = inputStream
.map(data => {
val dataArray = data.split(",")
SensorReading(dataArray(0), dataArray(1).toLong, dataArray(2).toDouble)
})
.assignTimestampsAndWatermarks( new BoundedOutOfOrdernessTimestampExtractor[SensorReading]
(Time.seconds(1)) {
override def extractTimestamp(element: SensorReading): Long = element.timestamp * 1000L
} )
// 將流轉換成表,直接定義時間字段
val sensorTable: Table = tableEnv
.fromDataStream(dataStream, 'id, 'temperature, 'timestamp.rowtime as 'ts)
// 1. Table API
// 1.1 Group Window聚合操作
val resultTable: Table = sensorTable
.window( Tumble over 10.seconds on 'ts as 'tw )
.groupBy( 'id, 'tw )
.select( 'id, 'id.count, 'tw.end )
// 1.2 Over Window 聚合操作
val overResultTable: Table = sensorTable
.window( Over partitionBy 'id orderBy 'ts preceding 2.rows as 'ow )
.select( 'id, 'ts, 'id.count over 'ow, 'temperature.avg over 'ow )
// 2. SQL實現
// 2.1 Group Windows
tableEnv.createTemporaryView("sensor", sensorTable)
val resultSqlTable: Table = tableEnv.sqlQuery(
"""
|select id, count(id), hop_end(ts, interval '4' second, interval '10' second)
|from sensor
|group by id, hop(ts, interval '4' second, interval '10' second)
""".stripMargin)
// 2.2 Over Window
val orderSqlTable: Table = tableEnv.sqlQuery(
"""
|select id, ts, count(id) over w, avg(temperature) over w
|from sensor
|window w as (
| partition by id
| order by ts
| rows between 2 preceding and current row
|)
""".stripMargin)
// sensorTable.printSchema()
// 打印輸出
// resultTable.toRetractStream[Row].print("agg")
// overResultTable.toAppendStream[Row].print("over result")
orderSqlTable.toAppendStream[Row].print("order sql")
env.execute("time and window test job")
}
}