一、概述
上次寫這篇文章文章的時候,Spark還是1.x,kafka還是0.8x版本,轉眼間spark到了2.x,kafka也到了2.x,存儲offset的方式也發生了改變,筆者根據上篇文章和網上文章,將offset存儲到Redis,既保證了併發也保證了數據不丟失,經過測試,有效。
二、使用場景
Spark Streaming實時消費kafka數據的時候,程序停止或者Kafka節點掛掉會導致數據丟失,Spark Streaming也沒有設置CheckPoint(據說比較雞肋,雖然可以保存Direct方式的offset,但是可能會導致頻繁寫HDFS佔用IO),所以每次出現問題的時候,重啓程序,而程序的消費方式是Direct,所以在程序down掉的這段時間Kafka上的數據是消費不到的,雖然可以設置offset爲smallest,但是會導致重複消費,重新overwrite hive上的數據,但是不允許重複消費的場景就不能這樣做。
三、原理闡述
在Spark Streaming中消費 Kafka 數據的時候,有兩種方式分別是 :
1.基於 Receiver-based 的 createStream 方法。receiver從Kafka中獲取的數據都是存儲在Spark Executor的內存中的,然後Spark Streaming啓動的job會去處理那些數據。然而,在默認的配置下,這種方式可能會因爲底層的失敗而丟失數據。如果要啓用高可靠機制,讓數據零丟失,就必須啓用Spark Streaming的預寫日誌機制(Write Ahead Log,WAL)。該機制會同步地將接收到的Kafka數據寫入分佈式文件系統(比如HDFS)上的預寫日誌中。所以,即使底層節點出現了失敗,也可以使用預寫日誌中的數據進行恢復。本文對此方式不研究,有興趣的可以自己實現,個人不喜歡這個方式。KafkaUtils.createStream
2.Direct Approach (No Receivers) 方式的 createDirectStream 方法,但是第二種使用方式中 kafka 的 offset 是保存在 checkpoint 中的,如果程序重啓的話,會丟失一部分數據,我使用的是這種方式。KafkaUtils.createDirectStream。本文將用代碼說明如何將 kafka 中的 offset 保存到 Redis 中,以及如何從 Redis 中讀取已存在的 offset。參數auto.offset.reset爲latest的時候程序纔會讀取redis的offset。
四、實現代碼
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.TopicPartition
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka010._
import scala.collection.JavaConverters._
import scala.util.Try
/**
* Created by chouyarn of BI on 2018/8/21
*/
object KafkaUtilsRedis {
/**
* 根據groupId保存offset
* @param ranges
* @param groupId
*/
def storeOffset(ranges: Array[OffsetRange], groupId: String): Unit = {
for (o <- ranges) {
val key = s"bi_kafka_offset_${groupId}_${o.topic}_${o.partition}"
val value = o.untilOffset
JedisUtil.set(key, value.toString)
}
}
/**
* 根據topic,groupid獲取offset
* @param topics
* @param groupId
* @return
*/
def getOffset(topics: Array[String], groupId: String): (Map[TopicPartition, Long], Int) = {
val fromOffSets = scala.collection.mutable.Map[TopicPartition, Long]()
topics.foreach(topic => {
val keys = JedisUtil.getKeys(s"bi_kafka_offset_${groupId}_${topic}*")
if (!keys.isEmpty) {
keys.asScala.foreach(key => {
val offset = JedisUtil.get(key)
val partition = Try(key.split(s"bi_kafka_offset_${groupId}_${topic}_").apply(1)).getOrElse("0")
fromOffSets.put(new TopicPartition(topic, partition.toInt), offset.toLong)
})
}
})
if (fromOffSets.isEmpty) {
(fromOffSets.toMap, 0)
} else {
(fromOffSets.toMap, 1)
}
}
/**
* 創建InputDStream,如果auto.offset.reset爲latest則從redis讀取
* @param ssc
* @param topic
* @param kafkaParams
* @return
*/
def createStreamingContextRedis(ssc: StreamingContext, topic: Array[String],
kafkaParams: Map[String, Object]): InputDStream[ConsumerRecord[String, String]] = {
var kafkaStreams: InputDStream[ConsumerRecord[String, String]] = null
val groupId = kafkaParams.get("group.id").get
val (fromOffSet, flag) = getOffset(topic, groupId.toString)
val offsetReset = kafkaParams.get("auto.offset.reset").get
if (flag == 1 && offsetReset.equals("latest")) {
kafkaStreams = KafkaUtils.createDirectStream(ssc, LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe(topic, kafkaParams, fromOffSet))
} else {
kafkaStreams = KafkaUtils.createDirectStream(ssc, LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe(topic, kafkaParams))
}
kafkaStreams
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("offSet Redis").setMaster("local[2]")
val ssc = new StreamingContext(conf, Seconds(60))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"group.id" -> "binlog.test.rpt_test_1min",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean),
"session.timeout.ms" -> "20000",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer]
)
val topic = Array("binlog.test.rpt_test", "binlog.test.hbase_test", "binlog.test.offset_test")
val groupId = "binlog.test.rpt_test_1min"
val lines = createStreamingContextRedis(ssc, topic, kafkaParams)
lines.foreachRDD(rdds => {
if (!rdds.isEmpty()) {
println("##################:" + rdds.count())
}
storeOffset(rdds.asInstanceOf[HasOffsetRanges].offsetRanges, groupId)
})
ssc.start()
ssc.awaitTermination()
}
}
五、JedisUtil代碼
import java.util
import com.typesafe.config.ConfigFactory
import org.apache.kafka.common.serialization.StringDeserializer
import redis.clients.jedis.{HostAndPort, JedisCluster, JedisPool, JedisPoolConfig}
object JedisUtil {
private val config = ConfigFactory.load("realtime-etl.conf")
private val redisHosts: String = config.getString("redis.server")
private val port: Int = config.getInt("redis.port")
private val hostAndPortsSet: java.util.Set[HostAndPort] = new util.HashSet[HostAndPort]()
redisHosts.split(",").foreach(host => {
hostAndPortsSet.add(new HostAndPort(host, port))
})
private val jedisConf: JedisPoolConfig = new JedisPoolConfig()
jedisConf.setMaxTotal(5000)
jedisConf.setMaxWaitMillis(50000)
jedisConf.setMaxIdle(300)
jedisConf.setTestOnBorrow(true)
jedisConf.setTestOnReturn(true)
jedisConf.setTestWhileIdle(true)
jedisConf.setMinEvictableIdleTimeMillis(60000l)
jedisConf.setTimeBetweenEvictionRunsMillis(3000l)
jedisConf.setNumTestsPerEvictionRun(-1)
lazy val redis = new JedisCluster(hostAndPortsSet, jedisConf)
def get(key: String): String = {
try {
redis.get(key)
} catch {
case e: Exception => e.printStackTrace()
null
}
}
def set(key: String, value: String) = {
try {
redis.set(key, value)
} catch {
case e: Exception => {
e.printStackTrace()
}
}
}
def hmset(key: String, map: java.util.Map[String, String]): Unit = {
// val redis=pool.getResource
try {
redis.hmset(key, map)
}catch {
case e:Exception => e.printStackTrace()
}
}
def hset(key: String, field: String, value: String): Unit = {
// val redis=pool.getResource
try {
redis.hset(key, field, value)
} catch {
case e: Exception => {
e.printStackTrace()
}
}
}
def hget(key: String, field: String): String = {
try {
redis.hget(key, field)
}catch {
case e:Exception => e.printStackTrace()
null
}
}
def hgetAll(key: String): java.util.Map[String, String] = {
try {
redis.hgetAll(key)
} catch {
case e: Exception => e.printStackTrace()
null
}
}
}
六、總結
根據不同的groupid來保存不同的offset,支持多個topic