Hadoop大數據入門——HDFS和MapReduce基礎使用

原創

Matrix_x

2019-06-19 02:58

一、分析處理數據集

數據集描述：

2011年某天某搜索引擎的搜索情況

數據集一共6列，分別爲時間、UID、搜索關鍵詞、選擇第幾個入口、搜索次數、URL。

初級階段我的研究目標是對搜索關鍵詞（keyword）進行詞頻統計。

那麼首先我需要對數據集進行預處理，這裏我使用python編寫處理程序，僅提取出關鍵字一列的內容形成新文件，爲下一步存入Hadoop的HDFS中做準備。處理代碼如下：

#數據預處理
import sys
path = ".\sogou.500w.utf8"   #數據來源
f = open(path,'r', encoding='UTF-8')
line = f.readline()
list = []
count = 0
while line:
    a = line.split('\t')
    b = a[2:3]
    list.append(b)
    line = f.readline()
    count = count+1
f.close
 
with open('sougou.txt','w',encoding='UTF-8') as month_file:    #提取後的數據文件
    for tag in list:
        for i in tag:
            month_file.write(str(i))
            month_file.write(' ')
        month_file.write('\n')

print(count)

文件處理結束後顯示文件總條數如下圖，共計500萬條。