NLTK使用總結

  1. nltk.tokenize.punkt()
    這個class能將text拆分成句子,但是會保留標點符號,比如括號之類的
import nltk.data
text = '''
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries.  And sometimes sentences
... can start with non-capitalized words.  i is a good variable
... name.
... '''
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
print('\n-----\n'.join(sent_detector.tokenize(text.strip())))
'''
...Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
'''
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章