readability文本可讀性的公式最初都是爲英語開發而來,所以目前僅支持英文文本數據。
文檔 https://pypi.org/project/readability/
安裝
pip install readability
Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
Collecting readability
Downloading https://mirrors.aliyun.com/pypi/packages/26/70/6f8750066255d4d2b82b813dd2550e0bd2bee99d026d14088a7b977cd0fc/readability-0.3.1.tar.gz (34 kB)
Building wheels for collected packages: readability
Building wheel for readability (setup.py) ... [?25ldone
[?25h Created wheel for readability: filename=readability-0.3.1-py3-none-any.whl size=35459 sha256=e920a8d6510bd1211df79a944ff03c94f2fea220ae4e5f430e930a52d75595ee
Stored in directory: /Users/thunderhit/Library/Caches/pip/wheels/90/29/a7/726a69748065b8c306b4a935ac2c57e9bc492cb23f355c8e03
Successfully built readability
Installing collected packages: readability
Successfully installed readability-0.3.1
快速上手
import readability
text = 'Note that tokens are separated by spaces and sentences by newlines.'
results = readability.getmeasures(text, lang='en')
results
OrderedDict([('readability grades',
OrderedDict([('Kincaid', 7.442500000000003),
('ARI', 5.825624999999999),
('Coleman-Liau', 9.532550312500003),
('FleschReadingEase', 55.95250000000002),
('GunningFogIndex', 10.700000000000001),
('LIX', 39.25),
('SMOGIndex', 9.70820393249937),
('RIX', 2.5),
('DaleChallIndex', 9.954550000000001)])),
('sentence info',
OrderedDict([('characters_per_word', 4.9375),
('syll_per_word', 1.6875),
('words_per_sentence', 8.0),
('sentences_per_paragraph', 2.0),
('type_token_ratio', 0.9375),
('characters', 79),
('syllables', 27),
('words', 16),
('wordtypes', 15),
('sentences', 2),
('paragraphs', 1),
('long_words', 5),
('complex_words', 3),
('complex_words_dc', 6)])),
('word usage',
OrderedDict([('tobeverb', 2),
('auxverb', 0),
('conjunction', 1),
('pronoun', 2),
('preposition', 2),
('nominalization', 1)])),
('sentence beginnings',
OrderedDict([('pronoun', 1),
('interrogative', 0),
('article', 0),
('subordination', 0),
('conjunction', 0),
('preposition', 0)]))])
返回的信息包括
readability grades可讀性指標
sentence info 句子信息
word usage 詞語使用
sentence beginnings句子開始
可讀性指標
results['readability grades']
OrderedDict([('Kincaid', 7.442500000000003),
('ARI', 5.825624999999999),
('Coleman-Liau', 9.532550312500003),
('FleschReadingEase', 55.95250000000002),
('GunningFogIndex', 10.700000000000001),
('LIX', 39.25),
('SMOGIndex', 9.70820393249937),
('RIX', 2.5),
('DaleChallIndex', 9.954550000000001)])
可讀性Kincaid指標
results['readability grades']['Kincaid']
7.442500000000003
同理其他指標都可以以字典的方式獲取
往期文章Pandas時間序列數據操作
Matplotlib中的plt和ax都是啥?
70G上市公司定期報告數據集
5個小問題帶你理解列表推導式
文本數據清洗之正則表達式
Python網絡爬蟲與文本數據分析
綜述:文本分析在市場營銷研究中的應用
如何批量下載上海證券交易所上市公司年報
Numpy和Pandas性能改善的方法和技巧
漂亮~pandas可以無縫銜接Bokeh
YelpDaset: 酒店管理類數據集10+G
先有收穫,再點在看!
公衆號後臺回覆關鍵詞 20200520 可獲得項目代碼