涉及知識點
1、抓取數據
2、分頁爬蟲
規律分析
1、抓取數據,發現每一項都是data-tools標籤
2、分頁分析
代碼
import requests from bs4 import BeautifulSoup import re import json import jieba #獲取html頁面信息 def getKeywordResult(keyword, pagenum): url = 'http://www.baidu.com/s?wd=' + keyword + '&pn=' + pagenum + '0' try: r = requests.get(url, timeout=30) r.raise_for_status() r.encoding = 'utf-8' return r.text except: return "" #解析並抽取數據 def parserLinks(html): soup = BeautifulSoup(html, "html.parser") links = [] for div in soup.find_all('div', {'data-tools':re.compile('title')}): data = div.attrs['data-tools'] d = json.loads(data) links.append(d['title']) words_all.append(d['title']) return links, words_all #詞頻統計 def words_ratio(words_all): words = [] for i in words_all: tmp = jieba.lcut(i) for tmp_word in tmp: words.append(tmp_word) counts = {} for word in words: if len(word) == 1: continue else: counts[word] = counts.get(word, 0) + 1 items = list(counts.items()) items.sort(key=lambda x: x[1], reverse=True) for i in range(30): word, count = items[i] print("{0:<10}{1:>5} 佔比:{2}".format(word, count, int(count)/len(words))) def main(): for pagenum in range(0, 50): html = getKeywordResult('老張', str(pagenum))#輸入搜索關鍵詞和頁數 ls, words_all = parserLinks(html) count = pagenum + 1 for i in ls: print("[{:^3}]{}".format(count, i)) ls = [] words_ratio(words_all) if __name__ == '__main__': words_all = [] main()
結果
後續的思考
代碼都很簡單,高手要懂得如何去擴展。現在雖然數據都爬下來了,但是非常凌亂,仍然需要人工去分析比對。這樣的數據我稱之爲裸數據,理想的數據是可讀且有關聯的,我稱之爲金子數據。
這個轉換分析的過程涉及到兩個問題:
1、如何實現可讀?
可以用字典裏面的del[]方法刪去壞的數據
2、如何實現數據的關聯性?
先將裸數據進行二次分析,將相關的字項放到一塊,然後再做運行