深度學習-詞頻統計

原創

Vivinia_Vivinia

2020-06-23 12:05

項目下載

項目頁面：

目標效果：

代碼：

import re
import pandas
import jieba
import numpy
import warnings
import matplotlib.pyplot as plt
from wordcloud import WordCloud
warnings.filterwarnings("ignore")   #忽略警告

"""讀取文件"""
filepath =open('mavel.txt', encoding='utf-8')
fileContent = filepath.read()
filepath.close()

"""處理數據，取漢字"""
pattern = re.compile(r'[\u4e00-\u9fa5]+')   #定義一個匹配對象，取出符合條件的中文，\u4e00”和“\u9fa5”是unicode編碼，也是中文編碼的開始和結束的兩個值
filterdata = re.findall(pattern,fileContent)   #使用findall返回符合條件的一個列表
cleaned_data= ''.join(filterdata)   #取出每一個漢字
segments=[]   #存儲分割後的詞組
segs = jieba.cut(cleaned_data)   #對文章進行分詞
outstr = ''   #用於存儲去掉停止詞的所有詞語
stopword = [line.strip() for line in open('stopwords.txt',encoding='UTF-8').readlines()]   #打開停止詞文件
for word in segs:   #對jieba切分出來的單詞進行遍歷
    if word not in stopword:   #如果該單詞不在停止詞中則保存
        if word != '\t':
            outstr += word
            outstr += ' '
outstr=outstr.split(' ')   #使用空格切分，將字符串轉換爲列表

"""分詞計數並排序"""
segmentDataFrame = pandas.DataFrame({'segment':outstr})   # 把得到的分詞和分詞來源再存到一個數據框中
segStat=segmentDataFrame.groupby(by=['segment'])['segment'].agg(numpy.size)   #每個分詞求頻率,segState保存的是詞組和對應個數
segStat=segStat.to_frame()   #將Series（可以理解爲數組）轉換爲DataFrame類型
segStat.columns=['計數']   #第一列爲索引，可以看到高度要矮，第二列更名爲計數
segStat = segStat.reset_index().sort_values('計數',ascending=False)   #reset_index()表示重新設置一列索引（從0開始），sort_values索引爲“計數”列，按照該列排序

"""設置顯示"""
wordcloud = WordCloud(
    font_path="C:\\Downloads\\simhei.ttf",   # simhei.ttf先百度下載字體文件，再在這裏寫保存字體文件的路徑
    background_color='black',
    max_words = 20   # 設置最多顯示的詞彙數
)
words=segStat.set_index('segment').to_dict()   # to_dict將list數據轉成字典
wordcloud.fit_words(words['計數'])   # 根據詞頻生成詞雲
plt.imshow(wordcloud)
plt.axis('off')   # 不展示座標系
plt.show()
plt.close()

註釋寫的挺詳細了，不多數，我就是喜歡貼代碼，啦啦啦~

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

深度學習-詞頻統計

深度學習-使用RNN生成詩

深度學習-逆卷積神經網絡ConvTranspose2d

python-numpy中axis的理解

深度學習-DRGAN對抗神經網絡生成動漫頭像

深度學習-使用t-SNE對MNIST數據集進行分類

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結