文件流
在文件流的應用場景中,需要編寫Spark Streaming 程序,一直對文件系統的某個目錄進行監聽,一旦發現有新的文件生成,
Spark Streaming就會自動把文件內容讀取過來,使用用戶自定義的處理邏輯進行處理套接字流
Spark Streaming可以通過Socket端口監聽並接收數據,然後進行相應的處理
一、在spark-shell中創建文件流
1、創建一個目錄 logfile
cd /usr/local/spark/mycode
mkdir streaming
cd streaming
mkdir logfile
2、在另一個終端進入spark-shell,依次輸入以下語句
scala> import org.apache.spark.streaming._
import org.apache.spark.streaming._
scala> val ssc = new StreamingContext(sc,Seconds(20))
ssc: org.apache.spark.streaming.StreamingContext = org.apache.spark.streaming.StreamingContext@2108d503
scala> val lines = ssc.
| textFileStream("file:///usr/local/spark/mycode/streaming/logfile")
lines: org.apache.spark.streaming.dstream.DStream[String] = org.apache.spark.streaming.dstream.MappedDStream@63cf0da6
scala> val words = lines.flatMap(_.split(" "))
words: org.apache.spark.streaming.dstream.DStream[String] = org.apache.spark.streaming.dstream.FlatMappedDStream@761b4581
scala> val wordCounts = words.map(x => (x,1)).reduceByKey(_ + _)
wordCounts: org.apache.spark.streaming.dstream.DStream[(String, Int)] = org.apache.spark.streaming.dstream.ShuffledDStream@515576b0
scala> wordCounts.print()
scala> ssc.start()
# 輸出以下:
-------------------------------------------
Time: 1592364620000 ms
-------------------------------------------
3、在回到剛剛創建文件夾的終端,在logfile目錄下創建文件log.txt
spark sql
spark streaming
spark MLlib
4、再回到spark-shell的終端就可以看到詞頻統計的結果
-------------------------------------------
Time: 1592364620000 ms
-------------------------------------------
-------------------------------------------
Time: 1592364640000 ms
-------------------------------------------
(spark,3)
(MLlib,1)
(streaming,1)
(sql,1)
-------------------------------------------
Time: 1592364660000 ms
-------------------------------------------
-------------------------------------------
Time: 1592364680000 ms
-------------------------------------------
二、採用獨立程序方式創建文件流
1、創建代碼目錄和代碼文件TestStreaming.scala
cd /usr/local/spark/mycode
mkdir streaming
cd streaming
mkdir file
cd file
mkdir -p /src/main/scala
cd src/main/scala
vi TestStreaming.scala
TestStreaming.scala
import org.apache.spark._
import org.apache.spark.streaming._
object WordCountStreaming{
def main(args:Array[String]){
val sparkConf = new SparkConf().setAppName("WordCountStreaming").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf,Seconds(2))
val lines = ssc.textFileStream("file:///usr/local/spark/mycode/streaming/logfile")
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x,1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
2、在/usr/local/spark/mycode/streaming/file目錄下創建simple.sbt文件,並寫入:
name :="Simple Project"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "2.4.5"
3、使用sbt工具對代碼進行打包:
cd /usr/local/spark/streaming/file
/usr/local/sbt/sbt package
4、打包成功後,輸入命令啓動這個程序
cd /usr/local/local/spark/mycode/streaming/file
/usr/local/spark/bin/spark-submit \
--class "WountCountStreaming" \
./target/scala-2.11/simple-project_2.11-1.0.jar
5、在另一個終端的/usr/local/spark/myycode/streaming/logfile目錄創建log2.txt文件並寫入:
spark sql
spark streaming
spark MLlib
6、再切回spark-shell的終端就可以看到單詞統計的信息
三、使用套接字流作爲數據源
在套接字流爲數據源的應用場景中,它通過Socket方式請求數據,獲取數據以後啓動流計算過程進行處理
1、創建代碼目錄和代碼文件NetworkWordCount.scala
cd /usr/local/spark/mycode
mkdir streaming #(如果存在則不用創建)
cd streaming
mkdir socket
cd scoket
mkdir -p scr/main/scala
cd /usr/local/spark/mycode/streaming/socket/src/main/scala
2、創建文件NetworkWordCount.scala和StreamingExamples.scala
cd /usr/local/spark/mycode/streaming/socket/src/main/scala
vi NetworkWordCount.scala
vi StreamingExamples.scala
NetworkWordCount.scala寫入代碼
package org.apache.spark.examples.streaming
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.storage.StorageLevel
object NetworkWordCount{
def main(args:Array[String]){
if(args.length < 2){
System.out.println("Usage: NetworkWordCount <hostname> <port>")
System.exit(1)
}
StreamingExamples.setStreamingLogLevels()
val sparkConf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf,Seconds(10))
val lines = ssc.socketTextStream(args(0),args(1).toInt,StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x,1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
StreamingExamples.scala寫入
package org.apache.spark.examples.streaming
import org.apache.spark.internal.Logging
import org.apache.log4j.{Level,Logger}
object StreamingExamples extends Logging{
def setStreamingLogLevels(){
val log4jInitialized = Logger.getRootLogger.getAllAppenders.hasMoreElements
if (!log4jInitialized){
logInfo("Setting log level to [WARN[ for streaming example."+"To override add custom log4j.properties to the classpath.")
Logger.getRootLogger.setLevel(Level.WARN)
}
}
}
3、創建打包代碼
在/usr/local/spark/mycode/streaming/socket創建一個文件simple.sbt並寫入
name :="Simple Project"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "2.4.5"
4、使用sbt工具對代碼進行編譯打包
cd /usr/local/spark/mycode/streaming/socket
/usr/local/sbt/sbt package
5、打包成功後輸入以下命令啓動這個程序
[root@master socket]# /usr/local/spark/bin/spark-submit \
> --class "org.apache.spark.examples.streaming.NetworkWordCount" \
> ./target/scala-2.11/simple-project_2.11-1.0.jar \
> localhost 9999
6、在另一個終端窗口輸入nc -lk 9999 回車,再輸入“hello spark streaming”
7、再回到剛剛那個終端窗口就可以看到輸出以下內容
四、使用Socket編程實現自定義數據源
採用自己編寫的程序產生數據源
1、編寫代碼文件 DataSourceSocket.scala
cd /usr/local/spark/mycode/streaming/socket/src/main/scala
vi DataSourceSocket.scala
寫入
package org.apache.spark.examples.streaming
import java.io.{PrintWriter}
import java.net.ServerSocket
import scala.io.Source
object DataSourceSocket{
def index(length:Int) = {
val rdm = new java.util.Random
rdm.nextInt(length)
}
def main(args:Array[String]){
if(args.length != 3){
System.err.println("usage:<filename> <port> <millisecond>")
System.exit(1)
}
val fileName = args(0)
val lines = Source.fromFile(fileName).getLines.toList
val rowCount = lines.length
val listener = new ServerSocket(args(1).toInt)
while (true){
val socket = listener.accept()
new Thread(){
override def run = {
println("Got client connected from: "+socket.getInetAddress)
val out = new PrintWriter(socket.getOutputStream(),true)
while (true){
Thread.sleep(args(2).toLong)
val content = lines(index(rowCount))
println(content)
out.write(content + '\n')
out.flush()
}
socket.close()
}
}.start()
}
}
}
2、使用sbt工具對代碼進行編譯打包
cd /usr/local/spark/mycode/streaming/socket
/usr/local/sbt/sbt package
3、打包成功後新建一個word.txt文件和輸入以下命令啓動DataSourceSocket這個程序
word.txt文件(在/us/local/spark/mycode/streaming/socket/目錄下)
hello spark streaming
hello spark sql
啓動DataSourceSocket這個程序
/usr/local/spark/bin/spark-submit \
> --class "org.apache.spark.examples.streaming.DataSourceSocket" \
> ./target/scala-2.11/simple-project_2.11-1.0.jar \
> ./word.txt 9999 1000
DataSourceSocket程序啓動後會一直監聽9999端口,一旦監聽到客戶端的連接請求,就會建立連接,每個1秒向客戶端源源不斷髮生數據
4、新開一個終端窗口啓動NetworkWordCount
[root@master socket]# /usr/local/spark/bin/spark-submit \
> --class "org.apache.spark.examples.streaming.NetworkWordCount" \
> ./target/scala-2.11/simple-project_2.11-1.0.jar \
> localhost 9999
執行上面命令後,就在當前的linux終端內順利啓動了Socket客戶端
他會向本機的9999號端口發起Socket連接,在另一個終端內正在運行的DataSourceSocket程序,一直在監聽9999端口,一旦監聽到NetworkWordCount程序的連接請求,就會建立連接,每個1秒向NetworkWordCount源源不斷的發送數據,終端的NetworkWordCount程序收到數據後,就會執行詞頻統計,打印如下信息(沒有完全截下來)
5、在另一個窗口可以看到源源不斷的發送數據