原理 (Streamming )
Spark Streaming
是核心 Spark API
的擴展,支持可擴展,高吞吐量,實時數據流的容錯數據流處理。可以從 sources(如 Kafka、Flume、Kinesis、或者 TCP sockets)
獲取數據,並通過複雜的算法處理數據,這些算法使用高級函數(如 map,reduce,join 和 window)
表示。
處理過的數據可以推送到文件系統、數據庫和實時儀表板。你可以將 Spark
的機器學習和圖形處理算法應用於數據流。
Spark Streaming
接受實時輸入數據流,並把數據分成批,然後由 Spark
引擎處理,以批量生成最終結果流。
離散流 (Discretized Streams)
Discretized Streams 或者 DStream 是 Spark Streaming 提供的基本抽象。它表示一個可持續的數據流,或者是從 source 接收的輸入數據流,或者是通過轉換輸入流生成的處理過的數據流。
Window 操作 (Window Operations)
Spark Streaming
也提供窗口化計算,這允許你在滑動的數據窗口上應用轉換 transformations
。
應用:(案例)
1、依賴
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>2.4.4</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.4</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.4</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.4</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.1.0</version>
</dependency>
2、案例
package com.citydo.faceadd;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.StreamingContext;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.dstream.DStream;
import org.apache.spark.streaming.dstream.InputDStream;
import org.apache.spark.streaming.kafka010.KafkaUtils;
import scala.Tuple2;
import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Objects;
import java.util.regex.Pattern;
public class SparkSteamingDemo {
private static final Pattern SPACE;
static {
SPACE = Pattern.compile(" ");
}
public static void main(String[] args) throws InterruptedException {
getWords();
getKafka(args);
}
/**
* spark streaming 統計數據
* @throws InterruptedException
*/
public static void getWords() throws InterruptedException {
//注意本地調試,master必須爲local[n],n>1,表示一個線程接收數據,n-1個線程處理數據
SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("streaming word count");
JavaSparkContext sc = new JavaSparkContext(conf);
//設置日誌運行級別
sc.setLogLevel("INFO");
JavaStreamingContext ssc = new JavaStreamingContext(sc, Durations.seconds(1));
//創建一個將要連接到hostname:port 的離散流
JavaReceiverInputDStream<String> lines = ssc.socketTextStream("localhost", 9999);
JavaPairDStream<String, Integer> wordCounts =
lines.flatMap(x->Arrays.asList(x.split(" ")).iterator())
.mapToPair(x -> new Tuple2<>(x, 1))
.reduceByKey((x, y) -> x + y);
// 在控制檯打印出在這個離散流(DStream)中生成的每個 RDD 的前十個元素
wordCounts.print();
// 啓動計算
ssc.start();
ssc.awaitTermination();
}
/**
* spark streaming 與Kafka 進行過濾數據
* @param args
* @throws InterruptedException
*/
public static void getKafka(String[] args) throws InterruptedException {
String checkPointDir = args[0];
String batchTime = args[1];
String topics = args[2];
String brokers = args[3];
Duration batchDuration = Durations.seconds(Integer.parseInt(batchTime));
SparkConf conf = new SparkConf().setAppName("streaming word count");
JavaStreamingContext jase = new JavaStreamingContext(conf, batchDuration);
// 設置Spark Streaming的CheckPoint目錄
jase.checkpoint(checkPointDir);
// 組裝Kafka的主題列表
HashSet<String> topicsSet = new HashSet<>(Arrays.asList(topics.split(",")));
HashMap<String, String> kafkaParams = new HashMap<>();
kafkaParams.put("metadata.broker.list", brokers);
// 通過brokers和topics直接創建kafka stream
// 1.接收Kafka中數據,生成相應DStream
InputDStream<ConsumerRecord<Object, Object>> lines;
lines = KafkaUtils.createDirectStream((StreamingContext) null,null,null,null);
// 2.獲取每一個行的字段屬性
DStream<Object> records = lines.count();
// 3.篩選女性網民上網時間數據信息
DStream<Object> femaleRecords = records.filter(null);
// 4.篩選連續上網時間超過閾值的用戶,並獲取結果
DStream<Object> upTimeUser = femaleRecords.filter(Objects::isNull);
upTimeUser.print();
//5.Spark Streaming系統啓動
jase.start();
jase.awaitTermination();
}
}
參考:https://www.cloudera.com/tutorials/introduction-to-spark-streaming/.html