文本分詞
分詞處理相關API:
import nltk.tokenize as tk
# 把樣本按句子進行拆分 sent_list:句子列表
sent_list = tk.sent_tokenize(text)
# 把樣本按單詞進行拆分 word_list:單詞列表
word_list = tk.word_tokenize(text)
# 把樣本按單詞進行拆分 punctTokenizer:分詞器對象
punctTokenizer = tk.WordPunctTokenizer()
word_list = punctTokenizer.tokenize(text)
案例:
import nltk.tokenize as tk
doc = "Are you curious about tokenization? " \
"Let's see how it works! " \
"We need to analyze a couple of sentences " \
"with punctuations to see it in action."
print(doc)
tokens = tk.sent_tokenize(doc)
for i, token in enumerate(tokens):
print("%2d" % (i + 1), token)
print('-' * 15)
tokens = tk.word_tokenize(doc)
for i, token in enumerate(tokens):
print("%2d" % (i + 1), token)
print('-' * 15)
tokenizer = tk.WordPunctTokenizer()
tokens = tokenizer.tokenize(doc)
for i, token in enumerate(tokens):
print("%2d" % (i + 1), token)
詞幹提取
文本樣本中的單詞的詞性與時態對於語義分析並無太大影響,所以需要對單詞進行詞幹提取。
詞幹提取相關API:
import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb
stemmer = pt.PorterStemmer() # 波特詞幹提取器,偏寬鬆
stemmer = lc.LancasterStemmer() # 朗卡斯特詞幹提取器,偏嚴格
stemmer = sb.SnowballStemmer('english') # 思諾博詞幹提取器,偏中庸
r = stemmer.stem('playing') # 提取單詞playing的詞幹
案例:
import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb
words = ['table', 'probably', 'wolves', 'playing',
'is', 'dog', 'the', 'beaches', 'grounded',
'dreamt', 'envision']
pt_stemmer = pt.PorterStemmer()
lc_stemmer = lc.LancasterStemmer()
sb_stemmer = sb.SnowballStemmer('english')
for word in words:
pt_stem = pt_stemmer.stem(word)
lc_stem = lc_stemmer.stem(word)
sb_stem = sb_stemmer.stem(word)
print('%8s %8s %8s %8s' % (
word, pt_stem, lc_stem, sb_stem))
table tabl tabl tabl
probably probabl prob probabl
wolves wolv wolv wolv
playing play play play
is is is is
dog dog dog dog
the the the the
beaches beach beach beach
grounded ground ground ground
dreamt dreamt dreamt dreamt
envision envis envid envis
詞性還原
與詞幹提取的作用類似,詞性還原更利於人工二次處理。因爲有些詞幹並非正確的單詞,人工閱讀更麻煩。詞性還原可以把名詞複數形式恢復爲單數形式,動詞分詞形式恢復爲原型形式。
詞性還原相關API:
import nltk.stem as ns
# 獲取詞性還原器對象
lemmatizer = ns.WordNetLemmatizer()
# 把單詞word按照名詞進行還原
n_lemma = lemmatizer.lemmatize(word, pos='n')
# 把單詞word按照動詞進行還原
v_lemma = lemmatizer.lemmatize(word, pos='v')
案例:
import nltk.stem as ns
words = ['table', 'probably', 'wolves', 'playing',
'is', 'dog', 'the', 'beaches', 'grounded',
'dreamt', 'envision']
lemmatizer = ns.WordNetLemmatizer()
for word in words:
n_lemma = lemmatizer.lemmatize(word, pos='n')
v_lemma = lemmatizer.lemmatize(word, pos='v')
print('%8s %8s %8s' % (word, n_lemma, v_lemma))
詞袋模型
一句話的語義很大程度取決於某個單詞出現的次數,所以可以把句子中所有可能出現的單詞作爲特徵名,每一個句子爲一個樣本,單詞在句子中出現的次數爲特徵值構建數學模型,稱爲詞袋模型。
The brown dog is running. The black dog is in the black room. Running in the room is forbidden.
1 The brown dog is running
2 The black dog is in the black room
3 Running in the room is forbidden
the | brown | dog | is | running | black | in | room | forbidden |
---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
2 | 0 | 1 | 1 | 0 | 2 | 1 | 1 | 0 |
1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 |
詞袋模型化相關API:
import sklearn.feature_extraction.text as ft
# 構建詞袋模型對象
cv = ft.CountVectorizer()
# 訓練模型,把句子中所有可能出現的單詞作爲特徵名,每一個句子爲一個樣本,單詞在句子中出現的次數爲特徵值。
bow = cv.fit_transform(sentences).toarray()
print(bow)
# 獲取所有特徵名
words = cv.get_feature_names()
案例:
import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft
doc = 'The brown dog is running. ' \
'The black dog is in the black room. ' \
'Running in the room is forbidden.'
print(doc)
sentences = tk.sent_tokenize(doc)
print(sentences)
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences).toarray()
print(bow)
[[0 1 1 0 0 1 0 1 1]
[2 0 1 0 1 1 1 0 2]
[0 0 0 1 1 1 1 1 1]]
words = cv.get_feature_names()
print(words)
['black', 'brown', 'dog', 'forbidden', 'in', 'is', 'room','running', 'the']