Flink計算最熱門TopN商品

爲了統計每個窗口下最熱門的商品，我們需要再次按窗口進行分組，這裏根據ItemViewCount中的windowEnd進行keyBy()操作。然後使用ProcessFunction實現一個自定義的TopN函數TopNHotItems來計算點擊量排名前3名的商品，並將排名結果格式化成字符串，便於後續輸出。

.keyBy("windowEnd")

.process(new TopNHotItems(3))

ProcessFunction是Flink提供的一個low-level API，用於實現更高級的功能。它主要提供了定時器timer的功能（支持EventTime或ProcessingTime）。本案例中我們將利用timer來判斷何時收齊了某個window下所有商品的點擊量數據。由於Watermark的進度是全局的，在processElement方法中，每當收到一條數據ItemViewCount，我們就註冊一個windowEnd+1的定時器（Flink框架會自動忽略同一時間的重複註冊）。windowEnd+1的定時器被觸發時，意味着收到了windowEnd+1的Watermark，即收齊了該windowEnd下的所有商品窗口統計值。我們在onTimer()中處理將收集的所有商品及點擊量進行排序，選出TopN，並將排名信息格式化成字符串後進行輸出。

這裏我們還使用了ListState<ItemViewCount>來存儲收到的每條ItemViewCount消息，保證在發生故障時，狀態數據的不丟失和一致性。ListState是Flink提供的類似Java List接口的State API，它集成了框架的checkpoint機制，自動做到了exactly-once的語義保證。

package analysis


import java.sql.Timestamp

import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor}
import org.apache.flink.api.java.tuple.{Tuple, Tuple1}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

import scala.collection.mutable.ListBuffer

/**
 * @author https://blog.csdn.net/qq_38704184
 * @package analysis
 * @date 2019/11/11 17:45
 * @version 1.0
 */
// 輸入數據樣例類
case class UserBehavior(userId: Long, itemId: Long, categoryId: Int, behavior: String, timestamp: Long)

// 輸出數據樣例類
case class ItemViewCount(itemId: Long, windowEnd: Long, count: Long)

object HotItems {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    env.setParallelism(1)

    env.readTextFile("E:\\bigdata\\037_Flink項目\\037_Flink項目\\UserBehavior.csv")
      .map(line => {
        val linearray: Array[String] = line.split(",")
        UserBehavior(linearray(0).toLong, linearray(1).toLong, linearray(2).toInt, linearray(3), linearray(4).toLong)
      })
      .assignAscendingTimestamps(_.timestamp * 1000)
      .filter(_.behavior == "pv")
      .keyBy("itemId")
      .timeWindow(Time.hours(1), Time.minutes(1))
      .aggregate(new CountAGG(), new WindowResultFunction())
      .keyBy("windowEnd")
      .process(new TopNHotItems(5))
      .print()

    env.execute("Hot Items Job")

  }
}

class CountAGG extends AggregateFunction[UserBehavior, Long, Long] {
  override def createAccumulator(): Long = 0L

  override def add(value: UserBehavior, accumulator: Long): Long = accumulator + 1

  override def getResult(accumulator: Long): Long = accumulator

  override def merge(a: Long, b: Long): Long = a + b
}

class WindowResultFunction extends WindowFunction[Long, ItemViewCount, Tuple, TimeWindow] {
  override def apply(key: Tuple,
                     window: TimeWindow,
                     input: Iterable[Long],
                     out: Collector[ItemViewCount]): Unit = {
    val itemId: Long = key.asInstanceOf[Tuple1[Long]].f0
    val count: Long = input.iterator.next()
    out.collect(ItemViewCount(itemId, window.getEnd, count))
  }
}

//自定義實現process function
class TopNHotItems(topSize: Int) extends KeyedProcessFunction[Tuple, ItemViewCount, String] {
  //  定義狀態ListState
  private var itemState: ListState[ItemViewCount] = _

  override def open(parameters: Configuration): Unit = {
    super.open(parameters)
    //    命名狀態變量的名字和類型
    val itemStateDesc = new ListStateDescriptor[ItemViewCount]("itemState", classOf[ItemViewCount])
    itemState = getRuntimeContext.getListState(itemStateDesc)
  }

  override def processElement(value: ItemViewCount,
                              ctx: KeyedProcessFunction[Tuple, ItemViewCount, String]#Context,
                              out: Collector[String]): Unit = {
    itemState.add(value)
    //    註冊定時器，觸發時間定爲windowEnd + 1，出發說明window已經收集完成所有數據
    ctx.timerService().registerEventTimeTimer(value.windowEnd + 1)
  }

  //  定時器出發操作，從state取出所有數據，排序TopN，輸出
  override def onTimer(timestamp: Long,
                       ctx: KeyedProcessFunction[Tuple, ItemViewCount, String]#OnTimerContext,
                       out: Collector[String]): Unit = {
    super.onTimer(timestamp, ctx, out)
    //    獲取收取商品點擊量
    val allItems: ListBuffer[ItemViewCount] = ListBuffer()
    import scala.collection.JavaConversions._
    for (item <- itemState.get()) {
      allItems += item
    }
    //    清除狀態中的數據，釋放空間
    itemState.clear()
    //    按照點擊率從大到小排序，選取TopN
    val sortedItems: ListBuffer[ItemViewCount] = allItems.sortBy(_.count)(Ordering.Long.reverse).take(topSize)

    //    將排名數據格式化，便於打印輸出
    val result = new StringBuilder()
    result.append("====================================\n")
    result.append("時間：")
    result.append(new Timestamp(timestamp - 1)).append("\n")

    for (i <- sortedItems.indices) {
      val currentItem: ItemViewCount = sortedItems(i)
      // 輸出打印的格式 e.g.  No1：  商品ID=12224  瀏覽量=2413
      result.append("No").append(i + 1).append(":")
        .append("  商品ID=").append(currentItem.itemId)
        .append("  瀏覽量=").append(currentItem.count).append("\n")
    }
    result.append("====================================\n\n")
    // 控制輸出頻率
    Thread.sleep(1000)
    out.collect(result.toString)
  }
}

Flink計算最熱門TopN商品

springboot優雅結合redis

scrapy爬取抖音視頻

Flink計算最熱門TopN商品

kafka 自定義存儲offset 到mysql中

scrapy爬取京東的數據

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結