文章目錄
1.flink與spark的對比
- spark是批處理,處理有界、持久化、大量需要訪問整個數據集才能完成計算,適合做離線計算
- fink是流處理,處理無界、實時,不需要訪問整個數據集就能完成計算,適合做實時計算
- spark的實時處理,只是縮短了批處理的時間,但也可以完成實時計算
- flink的實時處理,是真正的流式計算,延遲性要低於spark
- flink是事件驅動型,同樣的框架還有kfka,而且Kafka通過兩段式提交可以真正的完成exactly-once(對於發送的massage,receiver確保只收到一次)
- flink同時也有缺憾,他對sql語句的支持沒有spark那麼好用,因此許多大廠會在flink上開發,完成自己的需求
2.flink內部原理
2.1內部組件
2.1.1environment
- flink的環境主要使用getExecutionEnvironment,再本地使用就是本地模式,集羣使用就是集羣模式
- 還有兩個createLocalEnvironment(本地)、createRemoteEnvironment(遠程)
2.1.2source
- 主要來源於kafka詳見API操作
2.1.3transform
- flink的transform操作中大部分的算子和spark相似,但也有不同的
- flink將spark中的reducebykey分成了兩個部分,keyby 和reduce
- flink有一個split-select,類似與flume的攔截器和分區器,將進來的數據打上標籤,然後通過select方法獲得
- Union和Connect-CoMap:Union是強硬的讓兩個數據流在一起,但是他們的數據類型必須要相同,而Connect-CoMap可以兩個數據流不同,connect之後進入一個緩衝區培養感情(ConnectedStreams),然後CoMap,作用於ConnectedStreams上,功能與map和flatMap一樣,對ConnectedStreams中的每一個Stream分別進行map和flatMap處理。
2.1.4sink
- flink的sink要求嚴格,需要繼承官方指定的sink
2.2運行機制
2.2.1任務提交模式(yarn)
- Flink任務提交後,Client向HDFS上傳Flink的Jar包和配置
- 向Yarn ResourceManager提交任務,ResourceManager分配Container資源並通知對應的NodeManager啓動ApplicationMaster
- ApplicationMaster啓動後加載Flink的Jar包和配置構建環境,然後啓動JobManager,之後ApplicationMaster向ResourceManager申請資源啓動TaskManager
- ResourceManager分配Container資源後,由ApplicationMaster通知資源所在節點的NodeManager啓動TaskManager
- NodeManager加載Flink的Jar包和配置構建環境並啓動TaskManager,TaskManager啓動後向JobManager發送心跳包,並等待JobManager向其分配任務。
2.3worker和slots
-每一個worker(taskmannager)上可以有多個slots,提高了併發度
2.4EventTime和window
2.4.1Eventtime
- 在Flink的流式處理中,絕大部分的業務都會使用eventTime,一般只在eventTime無法使用時,纔會被迫使用ProcessingTime或者IngestionTime。
- Watermark:用於處理亂序事件的,而正確的處理亂序事件,通常用Watermark機制結合window來實現。
數據流中的Watermark用於表示timestamp小於Watermark的數據,都已經到達了,因此,window的執行也是由Watermark觸發的。
Watermark可以理解成一個延遲觸發機制,我們可以設置Watermark的延時時長t,每次系統會校驗已經到達的數據中最大的maxEventTime,然後認定eventTime小於maxEventTime - t的所有數據都已經到達,如果有窗口的停止時間等於maxEventTime – t,那麼這個窗口被觸發執行。
2.4.2window
- 滾動窗口(Tumbling Windows):將數據依據固定的窗口長度對數據進行切片。
特點:時間對齊,窗口長度固定,沒有重疊。
適用場景:適合做BI統計等(做每個時間段的聚合計算)。 - 滑動窗口(Sliding Windows):滑動窗口是固定窗口的更廣義的一種形式,滑動窗口由固定的窗口長度和滑動間隔組成。
特點:時間對齊,窗口長度固定,有重疊。
適用場景:對最近一個時間段內的統計(求某接口最近5min的失敗率來決定是否要報警 - 會話窗口(Session Windows):
由一系列事件組合一個指定時間長度的timeout間隙組成,類似於web應用的session,也就是一段時間沒有接收到新數據就會生成新的窗口。
特點:時間無對齊。
2.4.1
3.flink的API操作
3.1environment搭建
//引入配置
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_2.11</artifactId>
<version>1.7.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-scala -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.11</artifactId>
<version>1.7.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<!-- 該插件用於將Scala代碼編譯成class文件 -->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.4.6</version>
<executions>
<execution>
<!-- 聲明綁定到maven的compile階段 -->
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.0.0</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
//最常用
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//local,指定並行度
val env = StreamExecutionEnvironment.createLocalEnvironment(1)
//遠程,返回集羣執行環境,將Jar提交到遠程服務器。需要在調用時指定JobManager的IP和端口號,並指定要在集羣中運行的Jar包。
val env = ExecutionEnvironment.createRemoteEnvironment("jobmanager-hostname", 6123,"C://jar//flink//wordcount.jar")
3.2Kafkasource
//Kafka工具類
object MyKafkaUtil {
val prop = new Properties()
prop.setProperty("bootstrap.servers","bigdata1:9092")
prop.setProperty("group.id","test")
def getConsumer(topic:String ):FlinkKafkaConsumer011[String]= {
val myKafkaConsumer:FlinkKafkaConsumer011[String] = new FlinkKafkaConsumer011[String](topic, new SimpleStringSchema(), prop)
myKafkaConsumer
}
}
//主類
def main(args: Array[String]): Unit = {
val environment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
3.3split-select
// 將appstore與其他渠道拆分拆分出來 成爲兩個獨立的流
val splitStream: SplitStream[StartUpLog] = startUplogDstream.split { startUplog =>
var flags:List[String] = null
if ("標籤1" == startUplog.ch) {
flags = List(startUplog.ch)
} else {
flags = List("標籤2" )
}
flags
}
val appStoreStream: DataStream[StartUpLog] = splitStream.select("標籤1")
appStoreStream.print("標籤1:").setParallelism(1)
val otherStream: DataStream[StartUpLog] = splitStream.select("標籤2")
otherStream.print("標籤2:").setParallelism(1)
3.4合併
//connect-CoMap
//合併以後打印
val connStream: ConnectedStreams[StartUpLog, StartUpLog] = appStoreStream.connect(otherStream)
val allStream: DataStream[String] = connStream.map(
(log1: StartUpLog) => log1.ch,
(log2: StartUpLog) => log2.ch
)
allStream.print("connect::")
//Union
//合併以後打印
val unionStream: DataStream[StartUpLog] = appStoreStream.union(otherStream)
unionStream.print("union:::")
3.5Sink
3.5.1kafkasink
//配置
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-kafka-0.11 -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka-0.11_2.11</artifactId>
<version>1.7.0</version>
</dependency>
//util
def getProducer(topic:String): FlinkKafkaProducer011[String] ={
new FlinkKafkaProducer011[String](brokerList,topic,new SimpleStringSchema())
//主函數
val myKafkaProducer: FlinkKafkaProducer011[String] = MyKafkaUtil.getProducer("channel_sum")
sumDstream.map( chCount=>chCount._1+":"+chCount._2 ).addSink(myKafkaProducer)
3.5.2redis
//配置
<!-- https://mvnrepository.com/artifact/org.apache.bahir/flink-connector-redis -->
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>flink-connector-redis_2.11</artifactId>
<version>1.0</version>
</dependency>
//utils
object MyRedisUtil {
val conf = new FlinkJedisPoolConfig.Builder().setHost("hadoop1").setPort(6379).build()
def getRedisSink(): RedisSink[(String,String)] ={
new RedisSink[(String,String)](conf,new MyRedisMapper)
}
class MyRedisMapper extends RedisMapper[(String,String)]{
override def getCommandDescription: RedisCommandDescription = {
new RedisCommandDescription(RedisCommand.HSET, "channel_count")
// new RedisCommandDescription(RedisCommand.SET )
}
override def getValueFromData(t: (String, String)): String = t._2
override def getKeyFromData(t: (String, String)): String = t._1
}
}
//main
sumDstream.map( chCount=>(chCount._1,chCount._2+"" )).addSink(MyRedisUtil.getRedisSink())
3.5.3自定義JDBC
//配置
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.44</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>druid</artifactId>
<version>1.1.10</version>
</dependency>
//utils
class MyJdbcSink(sql:String ) extends RichSinkFunction[Array[Any]] {
val driver="com.mysql.jdbc.Driver"
val url="jdbc:mysql://hadoop2:3306/gmall2019?useSSL=false"
val username="root"
val password="123123"
val maxActive="20"
var connection:Connection=null;
//創建連接
override def open(parameters: Configuration): Unit = {
val properties = new Properties()
properties.put("driverClassName",driver)
properties.put("url",url)
properties.put("username",username)
properties.put("password",password)
properties.put("maxActive",maxActive)
val dataSource: DataSource = DruidDataSourceFactory.createDataSource(properties)
connection = dataSource.getConnection()
}
//反覆調用
override def invoke(values: Array[Any]): Unit = {
val ps: PreparedStatement = connection.prepareStatement(sql )
println(values.mkString(","))
for (i <- 0 until values.length) {
ps.setObject(i + 1, values(i))
}
ps.executeUpdate()
}
override def close(): Unit = {
if(connection!=null){
connection.close()
}
}
}
//main
val startUplogDstream: DataStream[StartUpLog] = dstream.map{ JSON.parseObject(_,classOf[StartUpLog])}
val jdbcSink = new MyJdbcSink("insert into z_startup values(?,?,?,?,?)")
startUplogDstream.map(startuplog=>Array(startuplog.mid,startuplog.uid,startuplog.ch,startuplog.area, startuplog.ts)).addSink(jdbcSink)