Kafka+sparkStreaming+Hbase

一、說明

1、需求分析

實時定位系統:實時定位某個用戶的具體位置,將最新數據進行存儲;

2、具體操作

sparkStreaming從kafka消費到原始用戶定位信息,進行分析。然後將分析之後且滿足需求的數據按rowkey=用戶名進行Hbase存儲;這裏爲了簡化,kafka消費出的原始數據即是分析好之後的數據,故消費出可以直接進行存儲;

3、組件版本

組件 版本
kafka kafka_2.10-0.10.2.1
spark spark-2.2.0-bin-hadoop2.7
hbase hbase-1.2.6

二、業務操作

1、在hbase中首先創建一個表

hbase(main):003:0> create 'location_sure','info'

2、pom依賴

<dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.11.8</version>
        </dependency>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-compiler</artifactId>
            <version>2.11.8</version>
        </dependency>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-reflect</artifactId>
            <version>2.11.8</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.2.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.2.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.11</artifactId>
            <version>2.2.1</version>
            <!--<scope>provided</scope>-->
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
            <version>2.2.1</version>
        </dependency>
        <dependency><!-- Spark Streaming Kafka -->
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka_2.10</artifactId>
            <version>1.6.3</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_2.11</artifactId>
            <version>2.2.1</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.gavaghan</groupId>
            <artifactId>geodesy</artifactId>
            <version>1.1.3</version>
        </dependency>
        <dependency>
            <groupId>com.github.scopt</groupId>
            <artifactId>scopt_2.11</artifactId>
            <version>3.7.0</version>
        </dependency>
        <dependency>
            <groupId>com.google.code.gson</groupId>
            <artifactId>gson</artifactId>
            <version>2.2.4</version>
        </dependency>

        <dependency>
            <groupId>redis.clients</groupId>
            <artifactId>jedis</artifactId>
            <version>2.9.0</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.codehaus.jettison/jettison -->
        <dependency>
            <groupId>org.codehaus.jettison</groupId>
            <artifactId>jettison</artifactId>
            <version>1.1</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/net.sf.json-lib/json-lib -->
        <dependency>
            <groupId>net.sf.json-lib</groupId>
            <artifactId>json-lib</artifactId>
            <version>2.4</version>
            <classifier>jdk15</classifier>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.commons/commons-pool2 -->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-pool2</artifactId>
            <version>2.4.2</version>
        </dependency>
        <!-- orcale驅動 -->
        <dependency>
            <groupId>com.oracle</groupId>
            <artifactId>ojdbc6</artifactId>
            <version>12.1.0.1-atlassian-hosted</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-client -->
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>1.2.6</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-server -->
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-server</artifactId>
            <version>1.2.6</version>
        </dependency>
    </dependencies>

3、工具類

package com.cn.util

import java.text.SimpleDateFormat
import java.util.Date

object TimeUtil {
  //時間轉化爲時間戳
  def tranTimeToLong(tm:String) :Long={
    val fm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
    val dt = fm.parse(tm)
    val aa = fm.format(dt)
    val tim: Long = dt.getTime()
    tim
  }

  //時間戳轉化爲時間
  def tranTimeToString(tm:String) :String={
    val fm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
    val tim = fm.format(new Date(tm.toLong))
    tim
  }
}
package com.cn.util

import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.{Connection, ConnectionFactory}

object HbaseUtils extends Serializable{

  /**
    * @param zkList
    * @param port
    * @return
    */
    def getHBaseConn(zkList: String, port: String): Connection = {
      val conf = HBaseConfiguration.create()
      conf.set("hbase.zookeeper.quorum", zkList)
      conf.set("hbase.zookeeper.property.clientPort", port)
      val connection = ConnectionFactory.createConnection(conf)
      connection
    }

}

4、造數據到kafka

package com.cn.util

import java.util.Properties

import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}
import org.codehaus.jettison.json.JSONObject

import scala.util.Random


/**
  * 編寫一個提交數據到kafka集羣的producer
  * 模擬場景:
  * 統計一些用戶實時步行的總步數,每隔5s統計一次,包括某個用戶新統計時的時間、所在地點
  */
object KafkaEventProducer {

  //用戶
  private val users = Array(
    "zhangSan", "liSi",
    "wangWu", "xiaoQiang",
    "zhangFei", "liuBei",
    "guanYu", "maChao",
    "caoCao", "guanYu"
  )

  private var pointer = -1

  //隨機獲得用戶
  def getUser(): String = {
    pointer = (pointer + 1) % users.length
    users(pointer)
  }

 
  val random = new Random()

  //獲取統計時間
  def getTime(): Long = {
    System.currentTimeMillis()
  }

  //獲取行走地點
  val walkPlace = Array(
    "操場南門", "操場東門", "操場北門", "操場西門", "操場東南門", "操場西北門", "操場西南門", "操場東南北門"
  )

  def getWalkPlace(): String = {
    walkPlace(random.nextInt(walkPlace.length))
  }


  def main(args: Array[String]): Unit = {

    val topic = "topic_walkCount"
    val brokers = "master:6667,slaves1:6667,slaves2:6667"
    //設置屬性,配置
    val props = new Properties()
    props.setProperty("bootstrap.servers", brokers)
    props.setProperty("metadata.broker.list", brokers)
    props.setProperty("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
    props.setProperty("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")

    //生成producer對象
    val producer = new KafkaProducer[String, String](props)

    //傳輸數據
    while (true) {
      val event = new JSONObject()
      event.put("user", getUser())
        .put("count_time", TimeUtil.tranTimeToString(getTime().toString))
        .put("walk_place", getWalkPlace())
      println(event.toString())
      //發送數據
      producer.send(new ProducerRecord[String, String](topic, event.toString))
      Thread.sleep(5000)
    }
  }
}

5、分析處理

package com.cn.sparkStreaming

import com.cn.util.{HbaseUtils, RedisUtils, TimeUtil}
import org.apache.hadoop.hbase.TableName
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.protobuf.generated.HBaseProtos.TimeUnit
import org.apache.hadoop.hbase.util.Bytes
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka010._
import org.codehaus.jettison.json.JSONObject

object kafka2sparkStreaming2Hbase {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("kafka2sparkStreaming2Hbase")
      .setMaster("local[1]")
    //.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    //設置流數據每批的時間間隔爲2s
    val ssc = new StreamingContext(conf, Seconds(1))
    //控制日誌輸出級別
    ssc.sparkContext.setLogLevel("WARN") //WARN,INFO,DEBUG
    ssc.checkpoint("checkpoint")
    val topic = "topic_walkCount"
    val groupId = "t02"
    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "master:6667,slaves1:6667,slaves2:6667",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> groupId,
      "auto.offset.reset" -> "earliest", // 初次啓動從最開始的位置開始消費
      "enable.auto.commit" -> (false: java.lang.Boolean) // 自動提交設置爲 false
    )

    val topics = Array(topic)
    val stream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String, String](
      ssc,
      LocationStrategies.PreferConsistent, //均勻分發到executor
      ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
    )

    stream.foreachRDD(rdd => {
      // 獲取每一個分區的消費的偏移量
      val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
      rdd.foreachPartition(partitions => {
        partitions.foreach(records => {
          val record = new JSONObject(records.value())
          val user = record.getString("user")
          val countTime = record.getString("count_time")

          val walkPlace = record.getString("walk_place")

          val tableName=TableName.valueOf("location_sure")
          //列簇名
          val columnFamily=Bytes.toBytes("info")
          //列名
          val count_time=Bytes.toBytes("count_time")
          val walk_place=Bytes.toBytes("walk_place")
          val connection = HbaseUtils.getHBaseConn("master,slaves1,slaves2","2181")
          val table = connection.getTable(tableName)

          val put = new Put(Bytes.toBytes(user))
          //注意,countTime、walkPlace一定要是string類型,不是要toString轉化,否則存入亂碼;
          put.addColumn(columnFamily,count_time,Bytes.toBytes(countTime))
          put.addColumn(columnFamily,walk_place,Bytes.toBytes(walkPlace))

          table.put(put)
          println("insert hbase success!")
        })
      })
      // 手動提交偏移量
      stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
    })
    ssc.start()
    ssc.awaitTermination()
  }
}

三、測試

1、數據源

{"user":"zhangSan","count_time":"2020-03-11 11:45:12","walk_place":"操場南門"}
{"user":"liSi","count_time":"2020-03-11 11:45:17","walk_place":"操場西南門"}
{"user":"wangWu","count_time":"2020-03-11 11:45:22","walk_place":"操場西南門"}
{"user":"xiaoQiang","count_time":"2020-03-11 11:45:27","walk_place":"操場東門"}
{"user":"zhangFei","count_time":"2020-03-11 11:45:32","walk_place":"操場北門"}
{"user":"liuBei","count_time":"2020-03-11 11:45:37","walk_place":"操場西門"}
{"user":"guanYu","count_time":"2020-03-11 11:45:42","walk_place":"操場南門"}
{"user":"maChao","count_time":"2020-03-11 11:45:47","walk_place":"操場南門"}
{"user":"caoCao","count_time":"2020-03-11 11:45:52","walk_place":"操場東南北門"}
{"user":"guanYu","count_time":"2020-03-11 11:45:57","walk_place":"操場西北門"}

2、hbase存儲展示

hbase(main):006:0> scan 'location_sure'
ROW                                COLUMN+CELL                                                                                         
 caoCao                            column=info:count_time, timestamp=1583898354092, value=2020-03-11 11:45:52                          
 caoCao                            column=info:walk_place, timestamp=1583898354092, value=\xE6\x93\x8D\xE5\x9C\xBA\xE4\xB8\x9C\xE5\x8D\
                                   x97\xE5\x8C\x97\xE9\x97\xA8                                                                         
 guanYu                            column=info:count_time, timestamp=1583898357267, value=2020-03-11 11:45:57                          
 guanYu                            column=info:walk_place, timestamp=1583898357267, value=\xE6\x93\x8D\xE5\x9C\xBA\xE8\xA5\xBF\xE5\x8C\
                                   x97\xE9\x97\xA8                                                                                     
 liSi                              column=info:count_time, timestamp=1583898330817, value=2020-03-11 11:45:17                          
 liSi                              column=info:walk_place, timestamp=1583898330817, value=\xE6\x93\x8D\xE5\x9C\xBA\xE8\xA5\xBF\xE5\x8D\
                                   x97\xE9\x97\xA8                                                                                     
 liuBei                            column=info:count_time, timestamp=1583898337245, value=2020-03-11 11:45:37                          
 liuBei                            column=info:walk_place, timestamp=1583898337245, value=\xE6\x93\x8D\xE5\x9C\xBA\xE8\xA5\xBF\xE9\x97\
                                   xA8                                                                                                 
 maChao                            column=info:count_time, timestamp=1583898354005, value=2020-03-11 11:45:47                          
 maChao                            column=info:walk_place, timestamp=1583898354005, value=\xE6\x93\x8D\xE5\x9C\xBA\xE5\x8D\x97\xE9\x97\
                                   xA8                                                                                                 
 wangWu                            column=info:count_time, timestamp=1583898330892, value=2020-03-11 11:45:22                          
 wangWu                            column=info:walk_place, timestamp=1583898330892, value=\xE6\x93\x8D\xE5\x9C\xBA\xE8\xA5\xBF\xE5\x8D\
                                   x97\xE9\x97\xA8                                                                                     
 xiaoQiang                         column=info:count_time, timestamp=1583898330990, value=2020-03-11 11:45:27                          
 xiaoQiang                         column=info:walk_place, timestamp=1583898330990, value=\xE6\x93\x8D\xE5\x9C\xBA\xE4\xB8\x9C\xE9\x97\
                                   xA8                                                                                                 
 zhangFei                          column=info:count_time, timestamp=1583898332253, value=2020-03-11 11:45:32                          
 zhangFei                          column=info:walk_place, timestamp=1583898332253, value=\xE6\x93\x8D\xE5\x9C\xBA\xE5\x8C\x97\xE9\x97\
                                   xA8                                                                                                 
 zhangSan                          column=info:count_time, timestamp=1583898330778, value=2020-03-11 11:45:12                          
 zhangSan                          column=info:walk_place, timestamp=1583898330778, value=\xE6\x93\x8D\xE5\x9C\xBA\xE5\x8D\x97\xE9\x97\
                                   xA8                                                                                                 
9 row(s) in 0.0480 seconds

hbase(main):007:0> 

3、hbase存儲程序讀出結果

rowkey:caoCao,列族:info,列:count_time,值:2020-03-11 11:45:52
rowkey:caoCao,列族:info,列:walk_place,值:操場東南北門
rowkey:guanYu,列族:info,列:count_time,值:2020-03-11 11:45:57
rowkey:guanYu,列族:info,列:walk_place,值:操場西北門
rowkey:liSi,列族:info,列:count_time,值:2020-03-11 11:45:17
rowkey:liSi,列族:info,列:walk_place,值:操場西南門
rowkey:liuBei,列族:info,列:count_time,值:2020-03-11 11:45:37
rowkey:liuBei,列族:info,列:walk_place,值:操場西門
rowkey:maChao,列族:info,列:count_time,值:2020-03-11 11:45:47
rowkey:maChao,列族:info,列:walk_place,值:操場南門
rowkey:wangWu,列族:info,列:count_time,值:2020-03-11 11:45:22
rowkey:wangWu,列族:info,列:walk_place,值:操場西南門
rowkey:xiaoQiang,列族:info,列:count_time,值:2020-03-11 11:45:27
rowkey:xiaoQiang,列族:info,列:walk_place,值:操場東門
rowkey:zhangFei,列族:info,列:count_time,值:2020-03-11 11:45:32
rowkey:zhangFei,列族:info,列:walk_place,值:操場北門
rowkey:zhangSan,列族:info,列:count_time,值:2020-03-11 11:45:12
rowkey:zhangSan,列族:info,列:walk_place,值:操場南門

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章