利用jieba庫對《秦吏》做的簡單處理

剛看完秦吏,想知道除了黑夫誰是出場率最高的角色,所以用Python中的jieba庫做了簡單分析

import jieba
import wordcloud
txt = open("秦吏.txt",'r',encoding='utf-8').read()
excludes = {"他們","自己","一個","沒有","就是","雖然","還是","不是","知道","已經","繼續","什麼","有些","只是","因爲","衆人","還有","如此","眼下","如今","所以", "那些",
          "將軍","起來","這些","不過","開始","可以","只能","郡守","甚至","便是","這個","不能","這是","最後","出來","作爲","說道","看着","於是","一樣","過去","地方",
          "以爲","時候","覺得","兵卒","爲了","可能","立刻","而是","現在","之後","今日","發現","不知","二人","不會","這樣","除了","這種","如何","這麼","只有","真是",
          "不少","官府","大軍","恐怕","依然","看到","一直","都尉","的話","離開","不敢","不同","幾個","一起","卻是","十分","郡尉","需要","時間","下來","這時候","爲何",
          "一邊","有人","抵達","記住","一些","兩個","當年","明白","得到","此事","一般", "聽說", "南方", "後世", "一點", "看來", "無法", "心裏", "com", "閱讀網", "www",
          "mayitxt","只要","才能","東西","希望","想要","一次","的確","必須","士卒","糧食","一切","過來","戰爭","商賈","這場","搖頭","朝廷","就算","楚人","這裏","回來",
          "律令","官吏","不必","當然","認爲","秦朝","秦人","匈奴","秦軍","咸陽","天下","秦國","膠東","南郡","楚國","安陸","關中","秦吏","百姓","中原","螞蟻","楚軍",
          "直接",}
words = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    elif word == "亭長" or word == "武忠侯":
        rword = "黑夫"
    elif word == "嬴政" or word == "始皇" or word == "皇帝" or word == "陛下" or word == "始皇帝" or word == "秦王":
        rword = "始皇"
    elif word == "公子":
        rword = "扶蘇"
    elif word == "東門":
        rword = "東門豹"
    elif word == "大鬍子" or word == "美髯公":
        rword = "劉季"
    else:
        counts[word] = counts.get(word,0) + 1
for word in excludes:
    del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(15):
    word, count = items[i]
    print("{0:<15}{1:>5}".format(word,count))

filtered = " ".join(words)
w = wordcloud.WordCloud(font_path="msyh.ttc", width=1000, height=700)
w.generate(filtered)
w.to_file("秦吏" + ".png")

思路就是用jieba庫進行精準分詞

words = jieba.lcut(txt)

再剔除一個字的詞

if len(word) == 1:
        continue

將多種詞語指一個詞的進行歸併處理

elif word == "嬴政" or word == "始皇" or word == "皇帝" or word == "陛下" or word == "始皇帝" or word == "秦王":
        rword = "始皇"

通過出場次數進行排序

items.sort(key=lambda x:x[1],reverse=True)

第一次執行代碼發現問題,有一些不屬於我要的詞彙例如 他們、自己 是高頻詞,所以進行了反覆篩選

excludes = {"他們","自己","一個","沒有","就是","雖然","還是","不是","知道","已經","繼續","什麼","有些","只是","因爲","衆人","還有","如此","眼下","如今","所以", "那些",
          "將軍","起來","這些","不過","開始","可以","只能","郡守","甚至","便是","這個","不能","這是","最後","出來","作爲","說道","看着","於是","一樣","過去","地方",
          "以爲","時候","覺得","兵卒","爲了","可能","立刻","而是","現在","之後","今日","發現","不知","二人","不會","這樣","除了","這種","如何","這麼","只有","真是",
          "不少","官府","大軍","恐怕","依然","看到","一直","都尉","的話","離開","不敢","不同","幾個","一起","卻是","十分","郡尉","需要","時間","下來","這時候","爲何",
          "一邊","有人","抵達","記住","一些","兩個","當年","明白","得到","此事","一般", "聽說", "南方", "後世", "一點", "看來", "無法", "心裏", "com", "閱讀網", "www",
          "mayitxt","只要","才能","東西","希望","想要","一次","的確","必須","士卒","糧食","一切","過來","戰爭","商賈","這場","搖頭","朝廷","就算","楚人","這裏","回來",
          "律令","官吏","不必","當然","認爲","秦朝","秦人","匈奴","秦軍","咸陽","天下","秦國","膠東","南郡","楚國","安陸","關中","秦吏","百姓","中原","螞蟻","楚軍",
          "直接",}
for word in excludes:
    del counts[word]

最後輸出結果爲

黑夫             16914
秦始皇            2705
扶蘇              1940
陳平              1427
韓信              1398
李斯              1105
趙高               923
季嬰               899
李由               810
劉季               772
李信               726
王賁               702
張蒼               641
蕭何               612
項籍               583

沒想到的是始皇和扶蘇居然排第二第三
當然代碼還有很多問題,很多因素沒考慮到

最後用詞雲進行一個可視化操作

w = wordcloud.WordCloud(font_path="msyh.ttc", width=1000, height=700)
w.generate(filtered)
w.to_file("秦吏" + ".png")

在這裏插入圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章