首先我們要明確 nltk 是一個處理自然語言的處理工具集,而不是分析自然語言,處理自然語言整理出適合機器學習框架使用的數據。
example_text = "Hello Mr. Smith, how are you doing today? The weather is great and Python is awesome. The sky is pinkish-blue. You should not eat carboard."
首先我們需要給出斷句的規則,如果我們根據(.)後面緊跟首字母大寫作爲規則進行斷句,那麼 Hello Mr. Smith
顯然也符合斷句規則。
不用擔心 nltk 可以幫助我們很好完成對段落按句子或單詞的劃分,要使用相應工具我們需要引入依賴包。
from nltk.tokenize import sent_tokenize, word_tokenize
print(sent_tokenize(example_text))
輸出單位爲句子,會對段落按一定規則劃分爲句子。
['Hello Mr. Smith, how are you doing today?', 'The weather is great and Python is awesome.', 'The sky is pinkish-blue.', 'You should not eat carboard.']
下面類似方式將段落劃分爲單詞
print(word_tokenize(example_text))
['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', 'and', 'Python', 'is','awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', 'not', 'eat', 'carboard', '.']
for i in word_tokenize(example_text):
print(i)
停止詞
首先,我們看下什麼是停止詞。停止詞,是由英文單詞:stop word翻譯過來的,原來在英語裏面會遇到很多a,the,or等使用頻率很多的字或詞。
在中文網站裏面其實也存在大量的stop word,我們稱它爲停止詞。比如,我們前面這句話,“在”、“裏面”、“也”、“的”、“它”、“爲”這些詞都是停止詞。這些詞因爲使用頻率過高,幾乎 每個網頁上都存在,所以搜索引擎開發人員都將這一類詞語全部忽略掉。
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
example_text = "This is an example showing off stop word filration."
stop_words = set(stopwords.words("english"))
print(stop_words)
{"it's", 'being', 'him', 'own', 'above', "you'll", 'yourself', 'again', 'because', 'a', 'i', 'yours', "didn't", 've', 'his', 'only', 'hasn', 'all', 'out', 'this', 'just', 'below', 'of', 'will', 'who', 'shan', 'or', 'should', 'here', 'be', 'against', 't', 'than', 'have', 'is', 'does', "wouldn't", 'hers', 'while', 'ours', 'there', 'when', 'himself', 'hadn', 'theirs', 'your', 'doing', 'before', "shouldn't", 'more', 'over', 'both', 'if', 'so', 'themselves', 'll', 'their', 'ma', 'now', 're', 'we', "won't", 'these', 'why', "she's", 'can', 'its', 'up', 'me', 'the', 'most', 'doesn', 'd', 'herself', "needn't", 'an', 'about', 'as', 'further', 'few', "haven't", 'other', 'aren', 'between', "couldn't", 'are', 'where', 'o', "doesn't", 'at', "you've", "wasn't", 'isn', 'each', "you'd", 'yourselves', 'has', 'did', 'off', 'couldn', 'y', "hasn't", 'very', 'not', "mustn't", 'my', 'then', 'myself', "don't", 'those', 'from','any', 'too', 'to', 'weren', 'am', "you're", 'them', 'down', "shan't", 'into', 'nor', 'ain', 'but', 'didn', 'mightn', 'on', 'and',"aren't", 'it', 'how', "that'll", 'wouldn', 'by', 'was', 'during', 'our', 'same', 'until', 'had', 'some', 'been', 'such', 'shouldn', 'do', 'having', "hadn't", 'that', 'mustn', 'don', 'were', 'what', 'ourselves', "mightn't", 'through', 'no', 'wasn', 'needn', 'he', "weren't", 'once', 'they', 'in', "isn't", 'won', 'after', 'you', 'itself', 'which', 'she', 'm', 'her', "should've", 'with', 'haven', 'under', 'for', 's', 'whom'}
words = word_tokenize(example_text)
filtered_sentence = []
for w in words:
if w not in stop_words:
filtered_sentence.append(w)
print(filtered_sentence)
['This', 'example', 'showing', 'stop', 'word', 'filration', '.']
stop_words = set(stopwords.words("english"))
# print(stop_words)
words = word_tokenize(example_text)
filtered_sentence = []
# for w in words:
# if w not in stop_words:
# filtered_sentence.append(w)
filtered_sentence = [w for w in words if not w in stop_words]
print(filtered_sentence)
詞幹分析器
在學習應該我們都學過動詞的時態,有時候我們需要剝去其變化看其本質這就是 stem 用途。
# I was taking a ride in the car.
# I was riding in the car.
在兩個句子中 ride 以兩種形式存在,但是表示意思都是 ride,
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
ps = PorterStemmer()
example_words = ["python", "pythoner", "pythoning", "pythoned", "pythonly"]
for w in example_words:
print(ps.stem(w))
python
python
python
python
pythonli
從輸出來看我們可以看出將 python 的其他形態去掉保留詞根。
new_text = "It is very important to be pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."
words = word_tokenize(new_text)
for w in words:
print(ps.stem(w))
大家可以自己輸出看一看,裏面好像有些問題,大家可以自己發現。
打標籤
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(sample_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
def process_content():
try:
for i in tokenized:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
print(tagged)
except Exception as e:
print(str(e))
process_content()
[(u'And', 'CC'), (u'so', 'RB'), (u'we', 'PRP'), (u'move', 'VBP'), (u'forward', 'RB'), (u'--', ':'), (u'optimistic', 'JJ'), (u'about', 'IN'), (u'our', 'PRP$'), (u'country', 'NN'), (u',', ','), (u'faithful', 'JJ'), (u'to', 'TO'), (u'its', 'PRP$'), (u'cause', 'NN'), (u',', ','), (u'and', 'CC'), (u'confident', 'NN'), (u'of', 'IN'), (u'the', 'DT'), (u'victories', 'NNS'), (u'to', 'TO'), (u'come', 'VB'), (u'.','.')]
CC 並列連詞 NNS 名詞複數 UH 感嘆詞
CD 基數詞 NNP 專有名詞 VB 動詞原型
DT 限定符 NNP 專有名詞複數 VBD 動詞過去式
EX 存在詞 PDT 前置限定詞 VBG 動名詞或現在分詞
FW 外來詞 POS 所有格結尾 VBN 動詞過去分詞
IN 介詞或從屬連詞 PRP 人稱代詞 VBP 非第三人稱單數的現在時
JJ 形容詞 PRP$ 所有格代詞 VBZ 第三人稱單數的現在時
JJR 比較級的形容詞 RB 副詞 WDT 以wh開頭的限定詞
JJS 最高級的形容詞 RBR 副詞比較級 WP 以wh開頭的代詞
LS 列表項標記 RBS 副詞最高級 WP$ 以wh開頭的所有格代詞
MD 情態動詞 RP 小品詞 WRB 以wh開頭的副詞
NN 名詞單數 SYM 符號 TO to