解決spark.rdd.MapPartitionsRDD cannot be cast to streaming.kafka010.HasOffsetRange問題

最近在做sparkstreaming測試的時候,自己出了一個小問題,記錄下.
貼部分代碼:

package com.ybs.screen.test.data

import java.lang
import java.util.Properties

import com.ybs.screen.constant.Constants
import com.ybs.screen.model.{ProperModel, UnitInfo}
import com.ybs.screen.utils.PropertiesUtil
import org.apache.kafka.clients.consumer.{Consumer, ConsumerRecord, KafkaConsumer}
import org.apache.kafka.common.TopicPartition
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010._
import org.apache.spark.{SparkConf, SparkContext, TaskContext}
import org.elasticsearch.spark.streaming.EsSparkStreaming

import scala.collection.JavaConverters._

object DemoTest {

  def main(args: Array[String]): Unit = {

    val sparkConf: SparkConf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[*]")

    val sparkSession: SparkSession = PropertiesUtil.getSparkSessionTest(sparkConf)
    sparkSession.sparkContext.setLogLevel("WARN")

    val ssc: StreamingContext = new StreamingContext(sparkSession.sparkContext,Seconds(10))

    //kafka集羣和topic
    val kafkaBrokers:String = ProperModel.getString(Constants.KAFKA_METADATA_BROKER_LIST)
    val kafkaTopics: String =ProperModel.getString( Constants.KAFKA_TOPICS)

    val kafkaParam = Map(
      "bootstrap.servers" -> kafkaBrokers,
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "group4",
      //
      "auto.offset.reset" -> "latest",
      "enable.auto.commit" -> (false: lang.Boolean)
    )

    //    ssc.checkpoint("./streaming_checkpoint")
    //從kafka獲取數據
    val inputDStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String, String](
      ssc,
      LocationStrategies.PreferConsistent,
      ConsumerStrategies.Subscribe[String, String](Set(kafkaTopics), kafkaParam, getLastOffsets(kafkaParam,Set(kafkaTopics))))

    val value: DStream[String] = inputDStream.map(x => x.value())

    EsSparkStreaming.saveToEs(value, "test/doc")

    value.foreachRDD(rdd =>{
      val offsetRanges: Array[OffsetRange] = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

      inputDStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
    })

    ssc.start()
    ssc.awaitTermination()

  }
	
}

在保存offset的時候,報了一個錯誤spark.rdd.MapPartitionsRDD cannot be cast to streaming.kafka010.HasOffsetRange保存offset報錯
在網上查了一下發現只有從kafka拿到的inputDStream,才能轉換爲kafkaRDD. 後面做其他操作的時候會把kafkaRDD轉換爲非kafkaRDD,這時候就會報錯,貼一下源碼

private[spark] class KafkaRDD[K, V](
    sc: SparkContext,
    val kafkaParams: ju.Map[String, Object],
    val offsetRanges: Array[OffsetRange],
    val preferredHosts: ju.Map[TopicPartition, String],
    useConsumerCache: Boolean
) extends RDD[ConsumerRecord[K, V]](sc, Nil) with Logging with HasOffsetRanges

//只有KafkaRDD纔可以轉換成OffsetRange
//且只有通過InputDStream所得到的第一手數據才包含KafkaRDD

知道了原因以後,解決起來就簡單了,可以在獲得的inputDStream裏面操作,獲取偏移量,將存往elasticsearch的操作放在inputDStream裏面, 或者在獲取到inputDStream的時候先保存offset.然後再操作,這裏我採取了笨一點的方法,在拿到inputDStream的時候就直接先存了offset.再進行其他操作

inputDStream.foreachRDD(rdd =>{
      val offsetRanges: Array[OffsetRange] = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

      inputDStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
    })

參考:spark.rdd.MapPartitionsRDD cannot be cast to streaming.kafka010.HasOffsetRange

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章