自然語言處理工具包 - NLTK

文本分詞

分詞處理相關API:

import nltk.tokenize as tk
# 把樣本按句子進行拆分  sent_list:句子列表
sent_list = tk.sent_tokenize(text)
# 把樣本按單詞進行拆分  word_list:單詞列表
word_list = tk.word_tokenize(text)
#  把樣本按單詞進行拆分 punctTokenizer:分詞器對象
punctTokenizer = tk.WordPunctTokenizer() 
word_list = punctTokenizer.tokenize(text)

案例:

import nltk.tokenize as tk
doc = "Are you curious about tokenization? " \
      "Let's see how it works! " \
      "We need to analyze a couple of sentences " \
      "with punctuations to see it in action."
print(doc)	
tokens = tk.sent_tokenize(doc)
for i, token in enumerate(tokens):
    print("%2d" % (i + 1), token)
print('-' * 15)
tokens = tk.word_tokenize(doc)
for i, token in enumerate(tokens):
    print("%2d" % (i + 1), token)
print('-' * 15)
tokenizer = tk.WordPunctTokenizer()
tokens = tokenizer.tokenize(doc)
for i, token in enumerate(tokens):
    print("%2d" % (i + 1), token)

在這裏插入圖片描述

詞幹提取

文本樣本中的單詞的詞性與時態對於語義分析並無太大影響,所以需要對單詞進行詞幹提取。

詞幹提取相關API:

import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb

stemmer = pt.PorterStemmer() # 波特詞幹提取器,偏寬鬆
stemmer = lc.LancasterStemmer() # 朗卡斯特詞幹提取器,偏嚴格
stemmer = sb.SnowballStemmer('english') # 思諾博詞幹提取器,偏中庸
r = stemmer.stem('playing') # 提取單詞playing的詞幹

案例:

import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb

words = ['table', 'probably', 'wolves', 'playing',
         'is', 'dog', 'the', 'beaches', 'grounded',
         'dreamt', 'envision']
pt_stemmer = pt.PorterStemmer()
lc_stemmer = lc.LancasterStemmer()
sb_stemmer = sb.SnowballStemmer('english')
for word in words:
    pt_stem = pt_stemmer.stem(word)
    lc_stem = lc_stemmer.stem(word)
    sb_stem = sb_stemmer.stem(word)
    print('%8s %8s %8s %8s' % (
        word, pt_stem, lc_stem, sb_stem))
    
 table     tabl     tabl     tabl
probably  probabl     prob  probabl
  wolves     wolv     wolv     wolv
 playing     play     play     play
      is       is       is       is
     dog      dog      dog      dog
     the      the      the      the
 beaches    beach    beach    beach
grounded   ground   ground   ground
  dreamt   dreamt   dreamt   dreamt
envision    envis    envid    envis

詞性還原

與詞幹提取的作用類似,詞性還原更利於人工二次處理。因爲有些詞幹並非正確的單詞,人工閱讀更麻煩。詞性還原可以把名詞複數形式恢復爲單數形式,動詞分詞形式恢復爲原型形式。

詞性還原相關API:

import nltk.stem as ns
# 獲取詞性還原器對象
lemmatizer = ns.WordNetLemmatizer()
# 把單詞word按照名詞進行還原
n_lemma = lemmatizer.lemmatize(word, pos='n')
# 把單詞word按照動詞進行還原
v_lemma = lemmatizer.lemmatize(word, pos='v')

案例:

import nltk.stem as ns
words = ['table', 'probably', 'wolves', 'playing',
         'is', 'dog', 'the', 'beaches', 'grounded',
         'dreamt', 'envision']
lemmatizer = ns.WordNetLemmatizer()
for word in words:
    n_lemma = lemmatizer.lemmatize(word, pos='n')
    v_lemma = lemmatizer.lemmatize(word, pos='v')
    print('%8s %8s %8s' % (word, n_lemma, v_lemma))

在這裏插入圖片描述

詞袋模型

一句話的語義很大程度取決於某個單詞出現的次數,所以可以把句子中所有可能出現的單詞作爲特徵名,每一個句子爲一個樣本,單詞在句子中出現的次數爲特徵值構建數學模型,稱爲詞袋模型。

The brown dog is running. The black dog is in the black room. Running in the room is forbidden.

1 The brown dog is running
2 The black dog is in the black room
3 Running in the room is forbidden

the brown dog is running black in room forbidden
1 1 1 1 1 0 0 0 0
2 0 1 1 0 2 1 1 0
1 0 0 1 1 0 1 1 1

詞袋模型化相關API:

import sklearn.feature_extraction.text as ft

# 構建詞袋模型對象
cv = ft.CountVectorizer()
# 訓練模型,把句子中所有可能出現的單詞作爲特徵名,每一個句子爲一個樣本,單詞在句子中出現的次數爲特徵值。
bow = cv.fit_transform(sentences).toarray()
print(bow)
# 獲取所有特徵名
words = cv.get_feature_names()

案例:

import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft
doc = 'The brown dog is running. ' \
      'The black dog is in the black room. ' \
      'Running in the room is forbidden.'
print(doc)
sentences = tk.sent_tokenize(doc)
print(sentences)
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences).toarray()
print(bow)
[[0 1 1 0 0 1 0 1 1]
 [2 0 1 0 1 1 1 0 2]
 [0 0 0 1 1 1 1 1 1]]
words = cv.get_feature_names()
print(words)
['black', 'brown', 'dog', 'forbidden', 'in', 'is', 'room','running', 'the']

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章