自然語言處理工具包 - NLTK

原創

2020-06-16 02:09

文本分詞

分詞處理相關API：

import nltk.tokenize as tk
# 把樣本按句子進行拆分  sent_list:句子列表
sent_list = tk.sent_tokenize(text)
# 把樣本按單詞進行拆分  word_list:單詞列表
word_list = tk.word_tokenize(text)
#  把樣本按單詞進行拆分 punctTokenizer：分詞器對象
punctTokenizer = tk.WordPunctTokenizer() 
word_list = punctTokenizer.tokenize(text)

案例：

import nltk.tokenize as tk
doc = "Are you curious about tokenization? " \
      "Let's see how it works! " \
      "We need to analyze a couple of sentences " \
      "with punctuations to see it in action."
print(doc)	
tokens = tk.sent_tokenize(doc)
for i, token in enumerate(tokens):
    print("%2d" % (i + 1), token)
print('-' * 15)
tokens = tk.word_tokenize(doc)
for i, token in enumerate(tokens):
    print("%2d" % (i + 1), token)
print('-' * 15)
tokenizer = tk.WordPunctTokenizer()
tokens = tokenizer.tokenize(doc)
for i, token in enumerate(tokens):
    print("%2d" % (i + 1), token)

詞幹提取

文本樣本中的單詞的詞性與時態對於語義分析並無太大影響，所以需要對單詞進行詞幹提取。

詞幹提取相關API：

import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb

stemmer = pt.PorterStemmer() # 波特詞幹提取器，偏寬鬆
stemmer = lc.LancasterStemmer() # 朗卡斯特詞幹提取器，偏嚴格
stemmer = sb.SnowballStemmer('english') # 思諾博詞幹提取器，偏中庸
r = stemmer.stem('playing') # 提取單詞playing的詞幹

案例：

import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb

words = ['table', 'probably', 'wolves', 'playing',
         'is', 'dog', 'the', 'beaches', 'grounded',
         'dreamt', 'envision']
pt_stemmer = pt.PorterStemmer()
lc_stemmer = lc.LancasterStemmer()
sb_stemmer = sb.SnowballStemmer('english')
for word in words:
    pt_stem = pt_stemmer.stem(word)
    lc_stem = lc_stemmer.stem(word)
    sb_stem = sb_stemmer.stem(word)
    print('%8s %8s %8s %8s' % (
        word, pt_stem, lc_stem, sb_stem))
    
 table     tabl     tabl     tabl
probably  probabl     prob  probabl
  wolves     wolv     wolv     wolv
 playing     play     play     play
      is       is       is       is
     dog      dog      dog      dog
     the      the      the      the
 beaches    beach    beach    beach
grounded   ground   ground   ground
  dreamt   dreamt   dreamt   dreamt
envision    envis    envid    envis

詞性還原

與詞幹提取的作用類似，詞性還原更利於人工二次處理。因爲有些詞幹並非正確的單詞，人工閱讀更麻煩。詞性還原可以把名詞複數形式恢復爲單數形式，動詞分詞形式恢復爲原型形式。

詞性還原相關API：

import nltk.stem as ns
# 獲取詞性還原器對象
lemmatizer = ns.WordNetLemmatizer()
# 把單詞word按照名詞進行還原
n_lemma = lemmatizer.lemmatize(word, pos='n')
# 把單詞word按照動詞進行還原
v_lemma = lemmatizer.lemmatize(word, pos='v')

案例：

import nltk.stem as ns
words = ['table', 'probably', 'wolves', 'playing',
         'is', 'dog', 'the', 'beaches', 'grounded',
         'dreamt', 'envision']
lemmatizer = ns.WordNetLemmatizer()
for word in words:
    n_lemma = lemmatizer.lemmatize(word, pos='n')
    v_lemma = lemmatizer.lemmatize(word, pos='v')
    print('%8s %8s %8s' % (word, n_lemma, v_lemma))

詞袋模型

一句話的語義很大程度取決於某個單詞出現的次數，所以可以把句子中所有可能出現的單詞作爲特徵名，每一個句子爲一個樣本，單詞在句子中出現的次數爲特徵值構建數學模型，稱爲詞袋模型。

The brown dog is running. The black dog is in the black room. Running in the room is forbidden.

1 The brown dog is running
2 The black dog is in the black room
3 Running in the room is forbidden

the	brown	dog	is	running	black	in	room	forbidden
1	1	1	1	1	0	0	0	0
2	0	1	1	0	2	1	1	0
1	0	0	1	1	0	1	1	1

詞袋模型化相關API：

import sklearn.feature_extraction.text as ft

# 構建詞袋模型對象
cv = ft.CountVectorizer()
# 訓練模型，把句子中所有可能出現的單詞作爲特徵名，每一個句子爲一個樣本，單詞在句子中出現的次數爲特徵值。
bow = cv.fit_transform(sentences).toarray()
print(bow)
# 獲取所有特徵名
words = cv.get_feature_names()

案例：

import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft
doc = 'The brown dog is running. ' \
      'The black dog is in the black room. ' \
      'Running in the room is forbidden.'
print(doc)
sentences = tk.sent_tokenize(doc)
print(sentences)
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences).toarray()
print(bow)
[[0 1 1 0 0 1 0 1 1]
 [2 0 1 0 1 1 1 0 2]
 [0 0 0 1 1 1 1 1 1]]
words = cv.get_feature_names()
print(words)
['black', 'brown', 'dog', 'forbidden', 'in', 'is', 'room','running', 'the']

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

自然語言處理工具包 - NLTK

文本分詞

詞幹提取

詞性還原

詞袋模型

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

scikit-learn實現ROC

網站該不該給用戶貸款呢

mysql--分支/循環

邏輯迴歸實現自動分類

自然語言處理工具包 - NLTK

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結