sparkstreaming下的第一個word count程序(python版)

首先從socket中讀取數據,然後通過sparkstreaming統計輸入的單詞個數

1.通過下面命令開啓端口(報錯則需安裝 nc)

nc -lk 9999

2.編寫sparkstreaming.py代碼

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# Create a local StreamingContext with two working thread and batch interval of 1 second
#至少需要2個核,因爲需要有一個核用於讀取數據
sc = SparkContext("local[2]", "NetworkWordCount")
#間隔一秒讀取一次數據流
ssc = StreamingContext(sc, 1)


# Create a DStream that will connect to hostname:port, like localhost:9999
lines = ssc.socketTextStream("localhost", 9999)

# Split each line into words
words = lines.flatMap(lambda line: line.split(" "))

# Count each word in each batch
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)

# Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.pprint()

ssc.start()             # Start the computation
ssc.awaitTermination()  # Wait for the computation to terminate

該段代碼的作用是,每隔1s時間,從9999端口讀取該時間段內輸入的數據,並統計讀取到的數據的word count。

3.spark-submit --master local sparkstreaming.py運行上述代碼。

   當在步驟1的窗口中輸入數據,則在運行spark的窗口可以看到統計結果。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章