代碼GitHub:https://github.com/SmallScorpion/flink-tutorial.git
map
val streamMap = stream.map { x => x * 2 }
flatMap
flatMap的函數簽名:def flatMap[A,B](as: List[A])(f: A ⇒ List[B]): List[B]
例如: flatMap(List(1,2,3))(i ⇒ List(i,i))
結果是List(1,1,2,2,3,3),
而List(“a b”, “c d”).flatMap(line ⇒ line.split(" "))
結果是List(a, b, c, d)。
val streamFlatMap = stream.flatMap{
x => x.split(" ")
}
Filter
val streamFilter = stream.filter{
x => x == 1
}
KeyBy
DataStream → KeyedStream:邏輯地將一個流拆分成不相交的分區,每個分區包含具有相同key的元素,在內部以hash的形式實現的。
這些算子可以針對KeyedStream的每一個支流做聚合:
- sum()
- min()
- max()
- minBy()
- maxBy()
// 取以ID爲組最低的溫度
val keyByDStream: DataStream[SensorReading] = dataDstream.keyBy("id").minBy("temperature")
Reduce
KeyedStream → DataStream:一個分組數據流的聚合操作,合併當前的元素和上次聚合的結果,產生一個新的值,返回的流中包含每一次聚合的結果,而不是隻返回最後一次聚合的最終結果。
// 3. 複雜聚合操作,reduce,得到當前id最小的溫度值,以及最新的時間戳+1
val reduceStream: DataStream[SensorReading] = dataDstream
.keyBy("id")
.reduce( (curState, newData) =>
// curState是之前數據 newData是現在數據
SensorReading( curState.id, newData.timestamp + 1, curState.temperature.min(newData.temperature)) )
split/select
DataStream → SplitStream:(split)根據某些特徵把一個DataStream拆分成兩個或者多個DataStream。(實際上還是一個流,只是給不同的數據打上標記)
SplitStream→DataStream:(select)從一個SplitStream中獲取一個或者多個DataStream。
需求:傳感器數據按照溫度高低(以30度爲界),拆分成兩個流。
import com.atguigu.bean.SensorReading
import org.apache.flink.streaming.api.scala._
/**
* 分流操作,split/select,以30度爲界劃分高低溫流
*/
object SplitAndSelectTransform {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val inputDStream: DataStream[String] = env.readTextFile("D:\\MyWork\\WorkSpaceIDEA\\flink-tutorial\\src\\main\\resources\\SensorReading.txt")
val dataDstream: DataStream[SensorReading] = inputDStream.map(
data => {
val dataArray: Array[String] = data.split(",")
SensorReading(dataArray(0), dataArray(1).toLong, dataArray(2).toDouble)
})
// 打上標記
val splitStream: SplitStream[SensorReading] = dataDstream.split(
data => {
if (data.temperature >= 30)
Seq("high")
else
Seq("low")
}
)
// 根據標記將SplitStream又轉換成DataStream
val highSensorDStream: DataStream[SensorReading] = splitStream.select("high")
val lowSensorDStream: DataStream[SensorReading] = splitStream.select("low")
val allSensorDStream: DataStream[SensorReading] = splitStream.select("high", "low")
highSensorDStream.print("high")
lowSensorDStream.print("low")
allSensorDStream.print("all")
env.execute("map test job")
}
}
Connect和 CoMap
DataStream,DataStream → ConnectedStreams:(Connect)連接兩個保持他們類型的數據流,兩個數據流被Connect之後,只是被放在了一個同一個流中,內部依然保持各自的數據和形式不發生任何變化,兩個流相互獨立。
ConnectedStreams → DataStream:(CoMap,CoFlatMap)作用於ConnectedStreams上,功能與map和flatMap一樣,對ConnectedStreams中的每一個Stream分別進行map和flatMap處理。
import com.atguigu.bean.SensorReading
import org.apache.flink.streaming.api.scala._
/**
* 合流操作,connect/comap
*/
object ConnectAndCoMapTransform {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val inputDStream: DataStream[String] = env.readTextFile("D:\\MyWork\\WorkSpaceIDEA\\flink-tutorial\\src\\main\\resources\\SensorReading.txt")
val dataDstream: DataStream[SensorReading] = inputDStream.map(
data => {
val dataArray: Array[String] = data.split(",")
SensorReading(dataArray(0), dataArray(1).toLong, dataArray(2).toDouble)
})
// 打上標記
val splitStream: SplitStream[SensorReading] = dataDstream.split(
data => {
if (data.temperature >= 30)
Seq("high")
else
Seq("low")
}
)
// 根據標記將SplitStream又轉換成DataStream
val highSensorDStream: DataStream[SensorReading] = splitStream.select("high")
val lowSensorDStream: DataStream[SensorReading] = splitStream.select("low")
// 爲了驗證connect可將兩個不相同參數的流進行合併,將其中一條流進行格式轉換成二元組形式
val highWarningDStream: DataStream[(String, Double)] = highSensorDStream.map(
data => (data.id, data.temperature)
)
// 將兩條流進行連接
val connectedStreams: ConnectedStreams[(String,Double),SensorReading] = highWarningDStream
.connect(lowSensorDStream)
// 將兩條流的數據分別處理合爲一條流
val coMapDStream: DataStream[(String, Double, String)] = connectedStreams.map(
// highWarningData是一個元組類型(String,Double)
highWarningData => (highWarningData._1, highWarningData._2, "Wraning"),
// 本身是一個SensorReading
lowTempData => (lowTempData.id, lowTempData.temperature, "normal")
)
coMapDStream.print("coMap")
env.execute("transform test job")
}
}
Union
DataStream → DataStream:(Union)對兩個或者兩個以上的DataStream進行union操作,產生一個包含所有DataStream元素的新DataStream。(可以同時合併多條流)
// 必須爲同類型流
val unionDStream: DataStream[SensorReading] = highSensorDStream.union(lowSensorDStream,allSensorDStream)
Connect與 Union 區別
1. Union之前兩個流的類型必須是一樣,Connect可以不一樣,在之後的coMap中再去調整成爲一樣的。
2. Connect只能操作兩個流,Union可以操作多個。