Python3——文章標題關鍵字提取

原創

Muzi_Water

2018-11-14 00:38

思路：1.讀取所有文章標題；2.用“結巴分詞”的工具包進行文章標題的詞語分割；3.用“sklearn”的工具包計算Tf-idf（詞頻-逆文檔率）;4.得到滿足關鍵詞權重閾值的詞

結巴分詞詳見：結巴分詞Github

sklearn詳見：文本特徵提取——4.2.3.4 Tf-idf項加權

import os
import jieba
import sys
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

sys.path.append("../")
jieba.load_userdict('userdictTest.txt')
STOP_WORDS = set((
        "基於", "面向", "研究", "系統", "設計", "綜述", "應用", "進展", "技術", "框架", "txt"
    ))

def getFileList(path):
    filelist = []
    files = os.listdir(path)
    for f in files:
        if f[0] == '.':
            pass
        else:
            filelist.append(f)
    return filelist, path

def fenci(filename, path, segPath):
    
    # 保存分詞結果的文件夾
    if not os.path.exists(segPath):
        os.mkdir(segPath)
    seg_list = jieba.cut(filename)
    result = []
    for seg in seg_list:
        seg = ''.join(seg.split())
        if len(seg.strip()) >= 2 and seg.lower() not in STOP_WORDS:
            result.append(seg)

    # 將分詞後的結果用空格隔開，保存至本地
    f = open(segPath + "/" + filename + "-seg.txt", "w+")
    f.write(' '.join(result))
    f.close()

def Tfidf(filelist, sFilePath, path, tfidfw):
    corpus = []
    for ff in filelist:
        fname = path + ff
        f = open(fname + "-seg.txt", 'r+')
        content = f.read()
        f.close()
        corpus.append(content)

    vectorizer = CountVectorizer()
    transformer = TfidfTransformer()
    tfvector = vectorizer.fit_transform(corpus)
    tfidf = transformer.fit_transform(tfvector)
    word = vectorizer.get_feature_names()
    weight = tfidf.toarray()

    if not os.path.exists(sFilePath):
        os.mkdir(sFilePath)

    for i in range(len(weight)):
        print('----------writing all the tf-idf in the ', i, 'file into ', sFilePath + '/', i, ".txt----------")
        f = open(sFilePath + "/" + str(i) + ".txt", 'w+')
        result = {}
        for j in range(len(word)):
            if weight[i][j] >= tfidfw:
                result[word[j]] = weight[i][j]
        resultsort = sorted(result.items(), key=lambda item: item[1], reverse=True)
        for z in range(len(resultsort)):
            f.write(resultsort[z][0] + " " + str(resultsort[z][1]) + '\r\n')
            print(resultsort[z][0] + " " + str(resultsort[z][1]))
        f.close()

tfvector = vectorizer.fit_transform(corpus)
vectorizer.fit_transform是將corpus中保存的切分後的單詞轉爲詞頻矩陣，其過程爲先將所有標題切分的單詞形成dictionary和csc_matric，其中dictionary如{‘農業’：0，‘大數據’：1，……}，csc_matric裏記錄了(標題下標，字典中單詞特徵的標號) 詞頻，然後對dictionary中的單詞進行排序重新編號，並對應更改csc_matric中的單詞特徵的標號，最後返回csc_matric

tfidf = transformer.fit_transform(tfvector)

transformer.fit_transform是根據tfvector中保存的csc_matric計算所有單詞的權重，其計算公式爲

$idf\left ( t \right ) = \log \left ( \frac{1+n }{1+df_t} \right )+1$

其中是所有文檔數量，是包含該單詞的文檔數。

以下面六個文章標題爲例進行關鍵詞提取

Using jieba on 農業大數據研究與應用進展綜述.txt
Using jieba on 基於Hadoop的分佈式並行增量爬蟲技術研究.txt
Using jieba on 基於RPA的財務共享服務中心賬表覈對流程優化.txt
Using jieba on 基於大數據的特徵趨勢統計系統設計.txt
Using jieba on 網絡大數據平臺異常風險監測系統設計.txt
Using jieba on 面向數據中心的多源異構數據統一訪問框架.txt
----------writing all the tf-idf in the 0 file into ./keywords/ 0 .txt----------
農業 0.773262366783
大數據 0.634086202434
----------writing all the tf-idf in the 1 file into ./keywords/ 1 .txt----------
hadoop 0.5
分佈式 0.5
並行增量 0.5
爬蟲 0.5

----------writing all the tf-idf in the 2 file into ./keywords/ 2 .txt----------
rpa 0.408248290464
優化 0.408248290464
服務中心 0.408248290464
流程 0.408248290464
財務共享 0.408248290464
賬表覈對 0.408248290464
----------writing all the tf-idf in the 3 file into ./keywords/ 3 .txt----------
特徵 0.521823488025
統計 0.521823488025
趨勢 0.521823488025
大數據 0.427902724969
----------writing all the tf-idf in the 4 file into ./keywords/ 4 .txt----------
大數據平臺 0.4472135955
異常 0.4472135955
監測 0.4472135955
網絡 0.4472135955
風險 0.4472135955
----------writing all the tf-idf in the 5 file into ./keywords/ 5 .txt----------
多源異構數據 0.57735026919
數據中心 0.57735026919
統一訪問 0.57735026919

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python3——文章標題關鍵字提取

《Python進階》學習筆記

Leetcode 3161. 物塊放置查詢

leetcode 60 排列序列

一個docker容器暴露多個端口

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

LSTM模型軌跡經緯度預測

win10 Anaconda創建、刪除、克隆、導出、查看環境，添加鏡像源

Win10查看CUDA版本

keras構建LSTM模型，預測帶高度的經緯度位置

Win10+IDEA創建Maven並配置Scala

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結