首先從socket中讀取數據,然後通過sparkstreaming統計輸入的單詞個數
1.通過下面命令開啓端口(報錯則需安裝 nc)
nc -lk 9999
2.編寫sparkstreaming.py代碼
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
# Create a local StreamingContext with two working thread and batch interval of 1 second
#至少需要2個核,因爲需要有一個核用於讀取數據
sc = SparkContext("local[2]", "NetworkWordCount")
#間隔一秒讀取一次數據流
ssc = StreamingContext(sc, 1)
# Create a DStream that will connect to hostname:port, like localhost:9999
lines = ssc.socketTextStream("localhost", 9999)
# Split each line into words
words = lines.flatMap(lambda line: line.split(" "))
# Count each word in each batch
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)
# Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.pprint()
ssc.start() # Start the computation
ssc.awaitTermination() # Wait for the computation to terminate
該段代碼的作用是,每隔1s時間,從9999端口讀取該時間段內輸入的數據,並統計讀取到的數據的word count。
3.spark-submit --master local sparkstreaming.py運行上述代碼。
當在步驟1的窗口中輸入數據,則在運行spark的窗口可以看到統計結果。