前言
(代碼親測)
Streaming-kafka-0-8 mysql、zookeeper
Streaming-kafka-0-10 kafka、redis
其中都是翻閱前輩們的代碼分享,總結匯總在這裏供自己參考,但 kafka 的 offset 生產一般都維護在 HBase、kafka、mysql 中,zookeeper 維護需要消耗 zookeeper 資源,切 zookeeper 還是比較昂貴的。
Kafka做爲一款流行的分佈式發佈訂閱消息系統,以高吞吐、低延時、高可靠的特點著稱,已經成爲Spark Streaming常用的流數據來源。
官方提供的思路就是,把JavaInputDStream轉換爲OffsetRange對象,該對象具有topic對應的分區的所有信息,每次batch處理完,Spark Streaming都會自動更新該對象,所以你只需要找個合適的地方保存該對象(比如HBase、HDFS、Mysql),就可以操縱offset了。
一、offset 保存在 mysql 中
首先在mysql中建立一張表用於存放 offset
數據庫相關配置信息
KafkaProducer 代碼
package Utils
import java.util.Properties
import kafka.producer.{KeyedMessage, Producer, ProducerConfig}
object kafka_producer {
def main(args: Array[String]): Unit = {
val topic ="word"
val brokers ="hadoop01:9092,hadoop02:9092,hadoop03:9092"
val prop=new Properties()
prop.put("metadata.broker.list",brokers)
prop.put("serializer.class", "kafka.serializer.StringEncoder")
val kafkaConfig=new ProducerConfig(prop)
val producer=new Producer[String,String](kafkaConfig)
val content:Array[String]=new Array[String](5)
content(0)="kafka kafka produce"
content(1)="kafka produce message"
content(2)="hello world hello"
content(3)="wordcount topK topK"
content(4)="hbase spark kafka"
while (true){
val i=(math.random * 5).toInt
producer.send(new KeyedMessage[String,String](topic,content(i)))
println(content(i))
Thread.sleep(2000)
}
}
}
<!-- scalikejdbc -->
<dependency>
<groupId>org.scalikejdbc</groupId>
<artifactId>scalikejdbc-core_2.11</artifactId>
<version>2.5.0</version>
</dependency>
<dependency>
<groupId>org.scalikejdbc</groupId>
<artifactId>scalikejdbc_2.11</artifactId>
<version>2.5.0</version>
</dependency>
<dependency>
<groupId>org.scalikejdbc</groupId>
<artifactId>scalikejdbc-config_2.11</artifactId>
<version>2.5.0</version>
</dependency>
SparkStreaming 代碼對結果的處理這裏只控制檯打印,將offset維護在 mysql 中
package SparkStreaming
import kafka.common.TopicAndPartition
import kafka.message.MessageAndMetadata
import kafka.serializer.StringDecoder
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka.KafkaCluster.{Err, LeaderOffset}
import org.apache.spark.streaming.kafka.{HasOffsetRanges, KafkaCluster, KafkaUtils, OffsetRange}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scalikejdbc.{DB, SQL}
import scalikejdbc.config.DBs
object SparkStreamingOffsetMySql {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[2]").setAppName("SparkStreamingOffsetMysql")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc,Seconds(2))
// 基本設置
val groupid = "GPMMCC"
val brokerList = "hadoop01:9092,hadoop02:9092,hadoop03:9092"
val topic = "word"
val topics = Set(topic)
val kafkaParams = Map(
"metadata.broker.list" -> brokerList,
"group.id" -> groupid,
"auto.offset.reset" -> kafka.api.OffsetRequest.SmallestTimeString
)
// connect mysql
DBs.setupAll()
val fromdbOffset: Map[TopicAndPartition, Long] = DB.readOnly {
implicit session => {
SQL(s"select * from offset where groupid = '${groupid}'").map(
m => (TopicAndPartition(m.string("topic"), m.string("partitions").toInt), m.string("offset").toLong )).toList().apply()
}.toMap
}
//創建一個DStream,來獲取kafka中數據
var kafkaDStream: InputDStream[(String,String)] = null
//從mysql中獲取數據進行判斷
if(fromdbOffset.isEmpty){
kafkaDStream = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](ssc,kafkaParams,topics)
}else{
//1\ 不能重複消費
//2\ 保證偏移量
var checkOffset = Map[TopicAndPartition,Long]()
//加載kafka配置
val kafkaCluster = new KafkaCluster(kafkaParams)
//首先獲得kafka中的所有的topic , partition , offset
val earliesOffset: Either[Err, Map[TopicAndPartition, LeaderOffset]] = kafkaCluster.getEarliestLeaderOffsets(fromdbOffset.keySet)
//然後開始比較大小,用mysql中的offset和kafka中的offset進行比較
if(earliesOffset.isRight){
//取到需要的 大Map(topic,partition,offset)
val tap: Map[TopicAndPartition, LeaderOffset] = earliesOffset.right.get
//比較,直接進行比較大小
checkOffset = fromdbOffset.map(f => {
//取kafka中的offset
//進行比較,不需要重複消費,取最大的
val KafkatopicOffset = tap.get(f._1).get.offset
if (f._2 > KafkatopicOffset) {
f
} else {
(f._1, KafkatopicOffset)
}
})
checkOffset
}
val messageHandler=(mmd:MessageAndMetadata[String,String])=>{(mmd.key(),mmd.message())}
//不是第一次啓動的話 ,按照之前的偏移量取數據的偏移量
kafkaDStream = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder,(String,String)](ssc,kafkaParams,checkOffset,messageHandler)
}
var offsetRanges = Array[OffsetRange]()
kafkaDStream.foreachRDD(kafkaRDD=>{
offsetRanges = kafkaRDD.asInstanceOf[HasOffsetRanges].offsetRanges
val map: RDD[String] = kafkaRDD.map(_._2)
map.foreach(println)
//更新偏移量
// DB.localTx(implicit session =>{
// //去到所有的topic partition offset
// for (o<- offsetRanges){
// SQL(s"replace into offset(groupid,topic,partitions,offset) values(?,?,?,?)").bind(
// groupid,o.topic.toString,o.partition.toInt,o.untilOffset.toLong
// ).update().apply()
// }
// })
DB.autoCommit(implicit session=>{
// SQL裏面是普通的sql語句,後面bind裏面是語句中"?"的值,update().apply()是執行語句
for(o <- offsetRanges) {
SQL("update offset set offset = ?,partitions = ? where groupid = ? and topic = ?").bind(
o.untilOffset.toLong,o.partition.toInt,groupid,topic
).update().apply()
}
})
})
ssc.start()
ssc.awaitTermination()
}
}
二、offset 保存在 HBase 中
三、offset 維護在 zookeeper 中
spark-streaming-kafka-0-8
(更詳細源於 : https://www.jianshu.com/p/0ef144022042?utm_campaign=maleskine&utm_content=note&utm_medium=seo_notes)
spark-streaming 消費 kafka 主類
package newSparkSteaming
import kafka.common.TopicAndPartition
import kafka.message.MessageAndMetadata
import org.apache.spark.SparkException
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka.{KafkaCluster, KafkaUtils, OffsetRange}
import org.apache.spark.streaming.kafka.KafkaCluster.{Err, LeaderOffset}
/*
* Create by Jerry on 2020/1/10
*
* Kafka offset管理類,使用zookeeper維護offset。除以下使用集成的kafka API去維護還可以使用zk client API去實現。
* */
class KafkaManager(val kafkaParams:Map[String,String]) extends Serializable {
private val kc = new KafkaCluster(kafkaParams)
/*
* 創建數據流
* */
def createDirectStream(ssc:StreamingContext,topics:Set[String]): InputDStream[(String,String)] ={
val groupid: String = kafkaParams.get("groupid").get
//從 zookeeper 上讀取 offset 前先根據實際情況更新 offsets
setOrUpdateOffsets(topics,groupid)
//從zookeeper上讀取offset開始消費message
val kafkaStream = {
val partitionsE: Either[Err, Set[TopicAndPartition]] = kc.getPartitions(topics)
if(partitionsE.isLeft){
throw new SparkException(s"get kafka partition failed: ${partitionsE.left.get}")
}
val partitions: Set[TopicAndPartition] = partitionsE.right.get
val consumerOffsetsE: Either[Err, Map[TopicAndPartition, Long]] = kc.getConsumerOffsets(groupid,partitions)
if(consumerOffsetsE.isLeft){throw new SparkException(s"get kafka consumer offsets failed: ${consumerOffsetsE.left.get}")}
val consumerOffsets: Map[TopicAndPartition, Long] = consumerOffsetsE.right.get
println(consumerOffsets)
KafkaUtils.createDirectStream(ssc,kafkaParams,consumerOffsets,(mmd: MessageAndMetadata[String, String]) => (mmd.key, mmd.message))
}
kafkaStream
}
/*
* 創建數據流前,根據實際消費情況更新消費 offsets,如果 Streaming 程序執行的時候出現
* kafka.common.OffsetOutOfRangeException,說明zk上保存的offsets已經過時了,即kafka的
* 定時清理策略應將包含offsets的文件刪除,針對這種情況,只要判斷一下zk上的consumerOffsets
* 和earliestLeaderOffsets的大小,如果consumerOffsets比earliestLeaderOffsets還小的話,
* 說明從速而Offsets已過時,這時把consumerOffsets更新爲earliestLeaderOffsets
* */
def setOrUpdateOffsets(topics:Set[String],groupid:String): Unit ={
topics.foreach(topic =>{
var hasConsumed = true
val partitionsE: Either[Err, Set[TopicAndPartition]] = kc.getPartitions(Set(topic))
if(partitionsE.isLeft){ throw new SparkException(s"get kafka partition failed:${partitionsE.left.get}") }
val partitions: Set[TopicAndPartition] = partitionsE.right.get
val consumerOffsetsE: Either[Err, Map[TopicAndPartition, Long]] = kc.getConsumerOffsets(groupid,partitions)
if(consumerOffsetsE.isLeft){ hasConsumed = false }
if(hasConsumed){
//消費過
val earliestLeaderOffsetsE = kc.getEarliestLeaderOffsets(partitions)
if(earliestLeaderOffsetsE.isLeft){ throw new SparkException(s"get earliest leader offsets failed:${earliestLeaderOffsetsE.left.get}") }
val earliestLeaderOffsets = earliestLeaderOffsetsE.right.get
val consumerOffsets: Map[TopicAndPartition, Long] = consumerOffsetsE.right.get
// 可能只是存在部分分區consumerOffsets過時,所以只更新過時分區的consumerOffsets爲earliestLeaderOffsets
var offsets: Map[TopicAndPartition, Long] = Map()
consumerOffsets.foreach({ case (tp,n) =>
val earliestLeaderOffset: Long = earliestLeaderOffsets(tp).offset
if(n < earliestLeaderOffset){
println("consumer group:" + groupid + ",topic:" + tp.topic + ",partition:" + tp.partition + " offsets已經過時,更新爲:" + earliestLeaderOffset)
offsets += (tp -> earliestLeaderOffset)
}
})
if (!offsets.isEmpty){
kc.setConsumerOffsets(groupid,offsets)
}
}else{
//首次消費
println(groupid + "第一次消費topic:" + topics)
val reset: Option[String] = kafkaParams.get("auto.offset.reset").map(_.toLowerCase())
var leaderOffsets: Map[TopicAndPartition,LeaderOffset] = null
if(reset == Some("smallest")){
val leaderOffsetsE = kc.getEarliestLeaderOffsets(partitions)
if(leaderOffsetsE.isLeft){throw new SparkException(s"get earliest leader offset failed:${leaderOffsetsE.left.get}")}
leaderOffsets = leaderOffsetsE.right.get
}else{
val leaderOffsetsE = kc.getEarliestLeaderOffsets(partitions)
if(leaderOffsetsE.isLeft){ throw new SparkException(s"get latest leader offsets failed:${leaderOffsetsE.left.get}")}
leaderOffsets = leaderOffsetsE.right.get
}
val offsets = leaderOffsets.map {
case (tp, offset) => (tp, offset.offset)
}
kc.setConsumerOffsets(groupid,offsets)
}
})
}
/*
* 更新 zookeeper 上的消費 offsets
* */
def updateOffsets(offsetRanges:Array[OffsetRange])={
val groupid = kafkaParams.get("groupid").get
for(offsets <- offsetRanges){
val topicAndPartition = TopicAndPartition(offsets.topic,offsets.partition)
val o: Either[Err, Map[TopicAndPartition, Short]] = kc.setConsumerOffsets(groupid,Map((topicAndPartition,offsets.untilOffset)))
if(o.isLeft){println(s"Error updating the offset to Kafka cluster: ${o.left.get}")}
}
}
}
Kafka offset管理類,使用zookeeper維護offset。除以下使用集成的kafka API去維護還可以使用zk client API去實現。
package newSparkSteaming
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka.HasOffsetRanges
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object DirectKafkameterData {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[2]").setAppName("DirectKafkameterData")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val ssc = new StreamingContext(conf,Seconds(10))
// kafka 節點
val broker_list = "hadoop01:9092,hadoop02:9092,hadoop03:9092"
val zk_servers = "hadoop01:2181,hadoop02:2181,hadoop03:2181"
val groupid = "groupid"
val topics = Set("topic")
/*
* 參數說明
* AUTO_OFFSET_RESET_CONFIG
* smallest:當各分區下有已提交的offset時,從提交的offset開始消費;無提交的offset時,從頭開始消費
* largest:當各分區下有已提交的offset時,從提交的offset開始消費;無提交的offset時,消費新產生的該分區下的數據
*disable:topic各分區都存在已提交的offset時,從offset後開始消費;只要有一個分區不存在已提交的offset,則拋出異常
*/
val kafkaParams = Map[String, String](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> broker_list,
ConsumerConfig.GROUP_ID_CONFIG -> groupid,
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> "smallest"
)
val kafkaManager = new KafkaManager(kafkaParams)
//創建數據流
val kafkaStream: InputDStream[(String, String)] = kafkaManager.createDirectStream(ssc,topics)
kafkaStream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd.map(msg => msg._2).foreachPartition(ite => {
ite.foreach(record => {
//處理數據的邏輯
println(record)
//處理數據的邏輯
})
})
kafkaManager.updateOffsets(offsetRanges)
}
ssc.start()
ssc.awaitTermination()
}
}
四、offset 保存在 redis 中
JedisPoolUtil.scala 工具類
package Utils
import com.typesafe.config.{Config, ConfigFactory}
import redis.clients.jedis.{Jedis,JedisPool, JedisPoolConfig}
object JedisPoolUtil {
//加載配置文件
private val config: Config = ConfigFactory.load()
private val host: String = config.getString("redis.host")
private val auth: String = config.getString("redis.auth")
private val port: Int = config.getInt("redis.port")
private val jedisConfig = new JedisPoolConfig
//最大連接數
jedisConfig.setMaxTotal(config.getInt("redis.maxConn"))
//最大空閒連接數
jedisConfig.setMaxIdle(config.getInt("redis.maxIdle"))
//設置連接池屬性
val pool = new JedisPool(jedisConfig,host,port,10000,auth)
def getConnections():Jedis = {
pool.getResource
}
}
StreamingOffsetToRedis.scala Streaming 類
package SpakStreaming
import Utils.JedisPoolUtil
import org.apache.kafka.common.TopicPartition
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.kafka010._
import scala.collection.mutable
import scala.util.Try
object StreamingOffsetToRedis {
def getOffset(topics:Set[String],groupid:String): mutable.Map[TopicPartition,Long] ={
val fromOffset: mutable.Map[TopicPartition, Long] = scala.collection.mutable.Map[TopicPartition,Long]()
//獲取 redis 中存的值
val jedis = JedisPoolUtil.getConnections()
topics.foreach(topic => {
val keys = jedis.keys(s"offset_${groupid}_${topic}")
if(!keys.isEmpty){
keys.forEach(key =>{
val offset = jedis.get(key)
val partition = Try(key.split(s"offset_${groupid}_${topic}_").apply(1)).getOrElse("0")
//輸出
println(partition + "::" + offset)
fromOffset.put(new TopicPartition(topic,partition.toInt),offset.toLong)
})
}
})
jedis.close()
fromOffset
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("StreamingOffsetToRedis").setMaster("local[2]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val ssc = new StreamingContext(sc,Seconds(2))
//kafka topic
val topics = Set("offset-redis-01")
//kafka params
val kafkaParams: Map[String, Object] = Map(
"bootstrap.servers" -> "node1:9092,node2:9092,node3:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "offSet-Redis-Test",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val groupid = kafkaParams.get("group.id").get.toString
//獲取消費方式消費方式有三種
//earlist:當個分區下有已經提交的offset時候,從提交的offset開始消費,無提交的offset時,從頭開始消費
//latest:當各分區下有已經提交的offset時候,從提交的offset開始消費,無提交的offset時,消費新產生的該分區下的數據
//none:topic各分區都存在已經提交的offset時,從offset後開始消費,只要有一個分區不存在已提交的offset,則拋出異常
val reset = kafkaParams.get("auto.offset.reset").get.toString
//獲取偏移量
val offsets = getOffset(topics,groupid)
//spark讀取分方式,均勻分佈
val locationStrategy: LocationStrategy = LocationStrategies.PreferConsistent
val conumerStrategy: ConsumerStrategy[String, String] = ConsumerStrategies.Subscribe(topics,kafkaParams,offsets)
val kafkaInputStream = KafkaUtils.createDirectStream(ssc,locationStrategy,conumerStrategy)
kafkaInputStream.foreachRDD(rdd =>{
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
if(!rdd.isEmpty()){
val jedis = JedisPoolUtil.getConnections()
//開啓jedis事務
val transaction = jedis.multi()
//代碼邏輯
rdd.foreachPartition(result =>{
reset.foreach(println)
})
//代碼邏輯
offsetRanges.foreach(iter => {
val key = s"offset_${groupid}_${iter.topic}_${iter.partition}"
val value = iter.untilOffset
transaction.set(key,value.toString)
})
transaction.exec()
transaction.clone()
jedis.close()
}
})
ssc.start()
ssc.awaitTermination()
}
}
(參考別人的代碼,自己協寫一遍,人後彙總這裏供以後參考)
五、offset 維護在 kafka 中
spark-streaming-kafka-0-10
package SpakStreaming
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka010.{CanCommitOffsets, HasOffsetRanges, KafkaUtils}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.slf4j.{Logger, LoggerFactory}
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
object StremingOffsetToKafka {
private val appName = "StreamingTest"
private val LOG: Logger = LoggerFactory.getLogger(appName)
def main(args: Array[String]): Unit = {
if(args.length != 1){
println("Usage:StreamingTest <propsName>")
System.exit(1)
}
LOG.info("################ Streaming Start ##################")
//讀取配置文件信息
val propName = args(0)
val conf = new SparkConf().setAppName("123").setMaster("local[2]")
conf.set("spark.streaming.kafka.maxRatePerPartition","200")
val ssc = new StreamingContext(conf,Seconds(4))
//kafka參數
val topics = Set("topic")
val groupid = "groupid"
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "kafka01:9092,kafka02:9092,kafka03:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> groupid,
"auto.offset.reset" -> "earliest", // 初次啓動從最開始的位置開始消費
"enable.auto.commit" -> (false: java.lang.Boolean) // 自動提交設置爲 false
)
// 方法一;使用kafka
val stream = KafkaUtils.createDirectStream[String,String](
ssc,
PreferConsistent,//均勻分發到 executor
Subscribe[String,String](topics,kafkaParams)
)
LOG.info("################## Create Streaming Success ####################")
stream.foreachRDD(rdd => {
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd.foreachPartition(iter => {
iter.foreach(line => {
//處理邏輯
println(line.value())
//
})
})
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
})
ssc.start()
ssc.awaitTermination()
}
}