數據挖掘-文本分析

含義：文本分析是指對文本的表示及其特徵項的選取；文本分析是文本挖掘、信息檢索的一個基本問題，它把從文本中抽取出的特徵詞進行量化來表示文本信息。

分析過程：

1.搭建語料庫（即要分析文章的集合）。
知識點：os模塊

import os;
import os.path;
import codecs;
#數組變量
filePaths=[];
fileContents = [];
#文件目錄，文件夾下的子目錄，文件
for root,dirs,files in os.walk(   
    #文件路徑，注意Windows下應是 ‘\\’
    "C:\\Users\\Desktop\\Python\\DM\\Sample"       
):
    for name in files:
        filePath = os.path.join(root,name) ;  #拼接文件路徑
        filePaths.append(filePath);
        f=codecs.open(filePath,'r','utf-8')  #讀取文件：文件路徑，打開方式，文件編碼
        fileContent = f.read()
        f.close()
        fileContents.append(fileContent)
import pandas;
corpos=pandas.DataFrame({
        'filePath':filePaths,
        'fileContent':fileContents
        })
 
#導入文件的時候， 設置 utf-8 文件編碼，文件中存在異常詞，可能會報錯
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbf in position 0: invalid start byte
解決方法：
將  f=codecs.open(filePath,'r','utf-8') 替換成
f=codecs.open(filePath,'r','gb18030',errors='ignore')
即可實現文件的正常讀取。

2.中文分詞—“結巴分詞”
知識點：jieba
安裝：pip install jieba

import jieba
segments = []    #分詞
filePaths = []     #文件路徑
#遍歷數據，完成分詞
for index, row in corpos.iterrows():
    filePath = row['filePath']
    fileContent = row['fileContent']
#分詞 jieba.cut(需要分詞的文件）返回數組
    segs = jieba.cut(fileContent)  
    for seg in segs:
        segments.append(seg)
        filePaths.append(filePath)
segmentDataFrame = pandas.DataFrame({
    'segment': segments, 
    'filePath': filePaths
});

注：如果分詞結果，不符合預期，可採用以下方法。

可增加自定義分詞：jieba.add_word(word) #word爲需要增加的分詞
導入自定義字典：jieba.load_userdict(filePath)#filePath文件路徑

3.統計詞頻

import numpy;
#進行詞頻統計        
segStat = segmentDataFrame.groupby(
            by="segment"
        )["segment"].agg({
            "計數":numpy.size
        })
text=segStat.reset_index(
         drop=False
        );

#移除停用詞
#第一種方法，最後去掉停用詞。
stopwords = pandas.read_csv(
    "C:\\Users\\lls\\Desktop\\Python\\DM\\StopwordsCN.txt", 
    encoding='utf8', 
    index_col=False
)

fSegStat = text[
    ~text.segment.isin(stopwords.stopword)
]

#第二種方法，分詞過程中過濾掉停用詞
import jieba
segments = []
filePaths = []
for index, row in corpos.iterrows():
    filePath = row['filePath']
    fileContent = row['fileContent']
    segs = jieba.cut(fileContent)
    for seg in segs:
        if seg not in stopwords.stopword.values and len(seg.strip())>0:
            segments.append(seg)
            filePaths.append(filePath)

segmentDataFrame = pandas.DataFrame({
    'segment': segments, 
    'filePath': filePaths
});

segStattext = segmentDataFrame.groupby(
            by="segment"
        )["segment"].agg({
            "計數":numpy.size
        })

停用詞：是指在信息檢索中，爲節省存儲空間和提高搜索效率，在處理自然語言數據（或文本）之前或之後會自動過濾掉某些字或詞，這些字或詞即被稱爲Stop Words（停用詞）。這些停用詞都是人工輸入、非自動化生成的，生成後的停用詞會形成一個停用詞表。但是，並沒有一個明確的停用詞表能夠適用於所有的工具。

分組統計：
DataFrame.groupby(by=列名數組)[統計列名數組].agg({‘統計項名稱’:統計函數}）

列表包含：DataFrame.列名.isin(數組）

取反：df[~df.列名.isin(數組)]

數據呈現：

詞雲繪製

1.下載Wordcloud文件  地址：https://www.lfd.uci.edu/~gohlke/pythonlibs/ 
   win + R   調出命令窗口 pip install wordcloud-1.2.1-cp35-cp35m-win_amd64.whl
   
2.繪製詞彙雲
from wordcloud import WordCloud
import matplotlib.pyplot as plt
wordcloud = WordCloud(
    font_path="C:\\Users\\lls\\Desktop\\Python\\DM\\2.4\\2.4\\simhei.ttf", 
    background_color="black"
)
words = text.set_index('segment').to_dict()
wordcloud.fit_words(words['計數'])
plt.imshow(wordcloud)

知識點總結：
生成wordcloud對象
wordcloud=WordCloud( font_path=‘simhei.ttf’, #中文字體
background_color=“black” #背景顏色）
繪製：wordcloudImg = wordclound.fit_words(dict)

詞雲圖美化

from scipy.misc import imread
import matplotlib.pyplot as plt
from wordcloud import WordCloud,ImageColorGenerator
bimg = imread("C:\\Users\\lls\\Desktop\\aaa.png")   #導入圖片
wordcloud = WordCloud(
    background_color="white", 
    mask=bimg,     #圖片賦值給變量
    font_path="C:\\Users\\lls\\Desktop\\Python\\simhei.ttf"
)
wordcloud = wordcloud.fit_words(words['計數'])
#設置詞雲圖的大小
plt.figure(
        num=None,
        figsize=(8,6),dpi=80,
        facecolor='w',edgecolor='k'
        )     
bimgColors = ImageColorGenerator(bimg)
plt.axis("off")
plt.imshow(wordcloud.recolor(color_func=bimgColors))
plt.show()

知識點：
讀取圖片背景：bimg=imread(imgFilePath)
獲取圖片顏色：bimgColors = ImageColorGenerator(bimg)
重置詞雲圖顏色：wordcloud.recolor(color_func=bimgColors)

完整代碼：

import os;
import os.path;
import codecs;
filePaths=[];
fileContents=[];
for root,dirs,files in os.walk(
        "C:\\Users\\lls\\Desktop\\Python\\aa"
        ):
    for name in files:
        filePath = os.path.join(root,name);
        filePaths.append(filePath);
        f=codecs.open(filePath,'r','gb18030',errors='ignore')
        fileContent=f.read()
        f.close()
        fileContents.append(fileContent)
import pandas;
corpos=pandas.DataFrame({
        'filePath':filePaths,
        'fileContent':fileContents
        }) 
import jieba
segments=[]
filePaths=[]
stopwords = pandas.read_csv(
        "C:\\Users\\lls\\Desktop\\Python\\DM\\StopwordsCN.txt",
        encoding='utf8',
        index_col=False)
for index,row in corpos.iterrows():
    filePath = row['filePath']
    fileContent=row['fileContent']
    segs=jieba.cut(fileContent)
    for seg in segs:
        if seg not in stopwords.stopword.values and len(seg.strip())>0:
            segments.append(seg)
            filePaths.append(filePath)
    segmentDataFrame = pandas.DataFrame(
            {'segment':segments,
             'filePath':filePaths
                    });
    import numpy;
    segStat=segmentDataFrame.groupby(
            by="segment"
            )["segment"].agg({
                    "計數":numpy.size
                    })
    text=segStat.reset_index(
            drop=False
            );
from wordcloud import WordCloud
import matplotlib.pyplot as plt
wordcloud = WordCloud( 
    font_path="C:\\Users\\lls\\Desktop\\Python\\DM\\simhei.ttf", 
    background_color="black"
)
words = text.set_index('segment').to_dict()
wordcloud.fit_words(words['計數'])
plt.imshow(wordcloud)

from scipy.misc import imread
import matplotlib.pyplot as plt
from wordcloud import WordCloud,ImageColorGenerator
bimg = imread("C:\\Users\\lls\\Desktop\\aaa.png")
wordcloud = WordCloud(
    background_color="white", 
    mask=bimg, font_path="C:\\Users\\lls\\Desktop\\simhei.ttf"
)

wordcloud = wordcloud.fit_words(words['計數'])
plt.figure(
        num=None,
        figsize=(8,6),dpi=80,
        facecolor='w',edgecolor='k'
        )

bimgColors = ImageColorGenerator(bimg)

plt.axis("off")
plt.imshow(wordcloud.recolor(color_func=bimgColors))
plt.show()

數據挖掘-文本分析

分析過程：

數據呈現：

完整代碼：

如何使用 JS 判斷用戶是否處於活躍狀態

Mono 支持LoongArch架構

lightdb秒級增加列和刪除列（not null帶默認值）

lightdb數據庫超時相關控制參數

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

網絡爬蟲的祕密：如何高效地抓取JD.com視頻鏈接

lightdb mysql 8.0兼容之不可見主鍵

使用 JS 實現在瀏覽器控制檯打印圖片 console.image()

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（四）使用域名訪問網站應用

高效能人士的七個習慣-讀書筆記

時間管理

定位-感悟摘錄

關鍵對話-讀書筆記

原則-讀書筆記

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結