目錄
第一步:準備zookeeper環境
(1)下載 zookeeper-3.4.14.tar.gz ,解壓,把conf 文件夾下面的 zoo.templet.cfg 改成zoo.cfg
(2)啓動zookeeper :cmd 到 zookeeper-3.4.14/bin文件夾下 輸入zkServer.cmd
第二步:準備kafka環境
(1)下載版本kafka_2.11-2.3.1 解壓, 修改config 目錄下的server.properties 文件,這隻log.dirs參數 log.dirs=C:/hnn/kafka/kafka_2.11-2.3.1/logs
(2)啓動kafka cmd到 C:\hnn\kafka\kafka_2.11-2.3.1 目錄下 執行 .\bin\windows\kafka-server-start.bat .\config\server.properties
(3)創建topic
>kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic sparkStreamingTest
創建成功顯示以下:Created topic sparkStreamingTest.
(4)打開生產者窗口 :kafka-console-producer.bat --broker-list localhost:9092 --topic sparkStreamingTest
(5)打開消費者窗口:.\bin\windows\kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic test
第三步:開發代碼
添加依賴:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>2.1.0</version>
</dependency>
開發代碼
package com.spark.self
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.storage.StorageLevel
import org.apache.log4j.{Level, Logger}
import kafka.serializer.StringDecoder
object WordCountSprakStreaming {
val numThreads = 1
// val topics = "test"
val topics = "sparkStreamingTest"
val zkQuorum = "localhost:2181"
val group = "consumer1"
val brokers = "localhost:9092"
def main(args: Array[String]): Unit = {
receiver
// direct
}
def receiver() = {
Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
val conf = new SparkConf().setAppName("kafka test").setMaster("local[2]")
val ssc = new StreamingContext(conf, Seconds(10));
ssc.checkpoint("/out")
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
val updateFunc = (curVal: Seq[Int], preVal: Option[Int]) => {
//進行數據統計當前值加上之前的值
var total = curVal.sum
//最初的值應該是0
var previous = preVal.getOrElse(0)
//Some 代表最終的返回值
Some(total + previous)
}
val words = lines.flatMap(_.split(" ")).map(x => (x, 1))
words.reduceByKey(_ + _).updateStateByKey(updateFunc).print()
ssc.start()
ssc.awaitTermination()
}
def direct() = {
Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
val conf = new SparkConf().setMaster("local[2]").setAppName("kafka test")
val ssc = new StreamingContext(conf, Seconds(10))
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet)
val lines = messages.map(_._2)
val words = lines.flatMap(_.split(" ")).map(x => (x, 1))
words.reduceByKey(_ + _).print()
ssc.start()
ssc.awaitTermination()
}
}
第四步:啓動SparkStreaming 程序
第五步:生產數據,如下所示:
控制檯顯示如下結果:
總結:
1. 以下兩端代碼是設置日誌的等級
Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
2.以下代碼的作用是做數據的累加,如果不加這段話,每10秒都會統計這10秒之內產生的數據不會累計,可以註釋試試
所以結果中需要更新數據 需要增加以下代碼:updateStateByKey(updateFunc) 如果需要累計結果還需要設置checkPoint(path)否則也會報錯錯誤如下:
val updateFunc = (curVal: Seq[Int], preVal: Option[Int]) => {
//進行數據統計當前值加上之前的值
var total = curVal.sum
//最初的值應該是0
var previous = preVal.getOrElse(0)
//Some 代表最終的返回值
Some(total + previous)
}