剛看完秦吏,想知道除了黑夫誰是出場率最高的角色,所以用Python中的jieba庫做了簡單分析
import jieba
import wordcloud
txt = open("秦吏.txt",'r',encoding='utf-8').read()
excludes = {"他們","自己","一個","沒有","就是","雖然","還是","不是","知道","已經","繼續","什麼","有些","只是","因爲","衆人","還有","如此","眼下","如今","所以", "那些",
"將軍","起來","這些","不過","開始","可以","只能","郡守","甚至","便是","這個","不能","這是","最後","出來","作爲","說道","看着","於是","一樣","過去","地方",
"以爲","時候","覺得","兵卒","爲了","可能","立刻","而是","現在","之後","今日","發現","不知","二人","不會","這樣","除了","這種","如何","這麼","只有","真是",
"不少","官府","大軍","恐怕","依然","看到","一直","都尉","的話","離開","不敢","不同","幾個","一起","卻是","十分","郡尉","需要","時間","下來","這時候","爲何",
"一邊","有人","抵達","記住","一些","兩個","當年","明白","得到","此事","一般", "聽說", "南方", "後世", "一點", "看來", "無法", "心裏", "com", "閱讀網", "www",
"mayitxt","只要","才能","東西","希望","想要","一次","的確","必須","士卒","糧食","一切","過來","戰爭","商賈","這場","搖頭","朝廷","就算","楚人","這裏","回來",
"律令","官吏","不必","當然","認爲","秦朝","秦人","匈奴","秦軍","咸陽","天下","秦國","膠東","南郡","楚國","安陸","關中","秦吏","百姓","中原","螞蟻","楚軍",
"直接",}
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word == "亭長" or word == "武忠侯":
rword = "黑夫"
elif word == "嬴政" or word == "始皇" or word == "皇帝" or word == "陛下" or word == "始皇帝" or word == "秦王":
rword = "始皇"
elif word == "公子":
rword = "扶蘇"
elif word == "東門":
rword = "東門豹"
elif word == "大鬍子" or word == "美髯公":
rword = "劉季"
else:
counts[word] = counts.get(word,0) + 1
for word in excludes:
del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(15):
word, count = items[i]
print("{0:<15}{1:>5}".format(word,count))
filtered = " ".join(words)
w = wordcloud.WordCloud(font_path="msyh.ttc", width=1000, height=700)
w.generate(filtered)
w.to_file("秦吏" + ".png")
思路就是用jieba庫進行精準分詞
words = jieba.lcut(txt)
再剔除一個字的詞
if len(word) == 1:
continue
將多種詞語指一個詞的進行歸併處理
elif word == "嬴政" or word == "始皇" or word == "皇帝" or word == "陛下" or word == "始皇帝" or word == "秦王":
rword = "始皇"
通過出場次數進行排序
items.sort(key=lambda x:x[1],reverse=True)
第一次執行代碼發現問題,有一些不屬於我要的詞彙例如 他們、自己 是高頻詞,所以進行了反覆篩選
excludes = {"他們","自己","一個","沒有","就是","雖然","還是","不是","知道","已經","繼續","什麼","有些","只是","因爲","衆人","還有","如此","眼下","如今","所以", "那些",
"將軍","起來","這些","不過","開始","可以","只能","郡守","甚至","便是","這個","不能","這是","最後","出來","作爲","說道","看着","於是","一樣","過去","地方",
"以爲","時候","覺得","兵卒","爲了","可能","立刻","而是","現在","之後","今日","發現","不知","二人","不會","這樣","除了","這種","如何","這麼","只有","真是",
"不少","官府","大軍","恐怕","依然","看到","一直","都尉","的話","離開","不敢","不同","幾個","一起","卻是","十分","郡尉","需要","時間","下來","這時候","爲何",
"一邊","有人","抵達","記住","一些","兩個","當年","明白","得到","此事","一般", "聽說", "南方", "後世", "一點", "看來", "無法", "心裏", "com", "閱讀網", "www",
"mayitxt","只要","才能","東西","希望","想要","一次","的確","必須","士卒","糧食","一切","過來","戰爭","商賈","這場","搖頭","朝廷","就算","楚人","這裏","回來",
"律令","官吏","不必","當然","認爲","秦朝","秦人","匈奴","秦軍","咸陽","天下","秦國","膠東","南郡","楚國","安陸","關中","秦吏","百姓","中原","螞蟻","楚軍",
"直接",}
for word in excludes:
del counts[word]
最後輸出結果爲
黑夫 16914
秦始皇 2705
扶蘇 1940
陳平 1427
韓信 1398
李斯 1105
趙高 923
季嬰 899
李由 810
劉季 772
李信 726
王賁 702
張蒼 641
蕭何 612
項籍 583
沒想到的是始皇和扶蘇居然排第二第三
當然代碼還有很多問題,很多因素沒考慮到
最後用詞雲進行一個可視化操作
w = wordcloud.WordCloud(font_path="msyh.ttc", width=1000, height=700)
w.generate(filtered)
w.to_file("秦吏" + ".png")