自然語言處理工具集 nltk (1)

首先我們要明確 nltk 是一個處理自然語言的處理工具集,而不是分析自然語言,處理自然語言整理出適合機器學習框架使用的數據。

example_text = "Hello Mr. Smith, how are you doing today? The weather is great and Python is awesome. The sky is pinkish-blue. You should not eat carboard."

首先我們需要給出斷句的規則,如果我們根據(.)後面緊跟首字母大寫作爲規則進行斷句,那麼 Hello Mr. Smith顯然也符合斷句規則。

不用擔心 nltk 可以幫助我們很好完成對段落按句子或單詞的劃分,要使用相應工具我們需要引入依賴包。

from nltk.tokenize import sent_tokenize, word_tokenize
print(sent_tokenize(example_text))

輸出單位爲句子,會對段落按一定規則劃分爲句子。

['Hello Mr. Smith, how are you doing today?', 'The weather is great and Python is awesome.', 'The sky is pinkish-blue.', 'You should not eat carboard.']

下面類似方式將段落劃分爲單詞

print(word_tokenize(example_text))
['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', 'and', 'Python', 'is','awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', 'not', 'eat', 'carboard', '.']
for i in word_tokenize(example_text):
    print(i)

停止詞

首先,我們看下什麼是停止詞。停止詞,是由英文單詞:stop word翻譯過來的,原來在英語裏面會遇到很多a,the,or等使用頻率很多的字或詞。

在中文網站裏面其實也存在大量的stop word,我們稱它爲停止詞。比如,我們前面這句話,“在”、“裏面”、“也”、“的”、“它”、“爲”這些詞都是停止詞。這些詞因爲使用頻率過高,幾乎 每個網頁上都存在,所以搜索引擎開發人員都將這一類詞語全部忽略掉。

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_text = "This is an example showing off stop word filration."

stop_words = set(stopwords.words("english"))
print(stop_words)
{"it's", 'being', 'him', 'own', 'above', "you'll", 'yourself', 'again', 'because', 'a', 'i', 'yours', "didn't", 've', 'his', 'only', 'hasn', 'all', 'out', 'this', 'just', 'below', 'of', 'will', 'who', 'shan', 'or', 'should', 'here', 'be', 'against', 't', 'than', 'have', 'is', 'does', "wouldn't", 'hers', 'while', 'ours', 'there', 'when', 'himself', 'hadn', 'theirs', 'your', 'doing', 'before', "shouldn't", 'more', 'over', 'both', 'if', 'so', 'themselves', 'll', 'their', 'ma', 'now', 're', 'we', "won't", 'these', 'why', "she's", 'can', 'its', 'up', 'me', 'the', 'most', 'doesn', 'd', 'herself', "needn't", 'an', 'about', 'as', 'further', 'few', "haven't", 'other', 'aren', 'between', "couldn't", 'are', 'where', 'o', "doesn't", 'at', "you've", "wasn't", 'isn', 'each', "you'd", 'yourselves', 'has', 'did', 'off', 'couldn', 'y', "hasn't", 'very', 'not', "mustn't", 'my', 'then', 'myself', "don't", 'those', 'from','any', 'too', 'to', 'weren', 'am', "you're", 'them', 'down', "shan't", 'into', 'nor', 'ain', 'but', 'didn', 'mightn', 'on', 'and',"aren't", 'it', 'how', "that'll", 'wouldn', 'by', 'was', 'during', 'our', 'same', 'until', 'had', 'some', 'been', 'such', 'shouldn', 'do', 'having', "hadn't", 'that', 'mustn', 'don', 'were', 'what', 'ourselves', "mightn't", 'through', 'no', 'wasn', 'needn', 'he', "weren't", 'once', 'they', 'in', "isn't", 'won', 'after', 'you', 'itself', 'which', 'she', 'm', 'her', "should've", 'with', 'haven', 'under', 'for', 's', 'whom'}
words = word_tokenize(example_text)
filtered_sentence = []
for w in words:
    if w not in stop_words:
        filtered_sentence.append(w)
print(filtered_sentence)
['This', 'example', 'showing', 'stop', 'word', 'filration', '.']
stop_words = set(stopwords.words("english"))
# print(stop_words)
words = word_tokenize(example_text)
filtered_sentence = []
# for w in words:
#     if w not in stop_words:
#         filtered_sentence.append(w)

filtered_sentence = [w for w in words if not w in stop_words]
print(filtered_sentence)

詞幹分析器

在學習應該我們都學過動詞的時態,有時候我們需要剝去其變化看其本質這就是 stem 用途。

# I was taking a ride in the car.
# I was riding in the car.

在兩個句子中 ride 以兩種形式存在,但是表示意思都是 ride,

from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

ps = PorterStemmer()

example_words = ["python", "pythoner", "pythoning", "pythoned", "pythonly"]

for w in example_words:
    print(ps.stem(w))
python
python
python
python
pythonli

從輸出來看我們可以看出將 python 的其他形態去掉保留詞根。

new_text = "It is very important to be pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."

words = word_tokenize(new_text)
for w in words:
    print(ps.stem(w))

大家可以自己輸出看一看,裏面好像有些問題,大家可以自己發現。

打標籤

import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(sample_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)


def process_content():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            print(tagged)
    except Exception as e:
        print(str(e))


process_content()
[(u'And', 'CC'), (u'so', 'RB'), (u'we', 'PRP'), (u'move', 'VBP'), (u'forward', 'RB'), (u'--', ':'), (u'optimistic', 'JJ'), (u'about', 'IN'), (u'our', 'PRP$'), (u'country', 'NN'), (u',', ','), (u'faithful', 'JJ'), (u'to', 'TO'), (u'its', 'PRP$'), (u'cause', 'NN'), (u',', ','), (u'and', 'CC'), (u'confident', 'NN'), (u'of', 'IN'), (u'the', 'DT'), (u'victories', 'NNS'), (u'to', 'TO'), (u'come', 'VB'), (u'.','.')]
CC  並列連詞          NNS 名詞複數        UH 感嘆詞
CD  基數詞              NNP 專有名詞        VB 動詞原型
DT  限定符            NNP 專有名詞複數    VBD 動詞過去式
EX  存在詞            PDT 前置限定詞      VBG 動名詞或現在分詞
FW  外來詞            POS 所有格結尾      VBN 動詞過去分詞
IN  介詞或從屬連詞     PRP 人稱代詞        VBP 非第三人稱單數的現在時
JJ  形容詞            PRP$ 所有格代詞     VBZ 第三人稱單數的現在時
JJR 比較級的形容詞     RB  副詞            WDT 以wh開頭的限定詞
JJS 最高級的形容詞     RBR 副詞比較級      WP 以wh開頭的代詞
LS  列表項標記         RBS 副詞最高級      WP$ 以wh開頭的所有格代詞
MD  情態動詞           RP  小品詞          WRB 以wh開頭的副詞
NN  名詞單數           SYM 符號            TO  to

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章