爬取的網站鏈接爲 base_url= https://www.gushiwen.org/
想按照頁面右邊欄的各個分類進行爬取,例如“春天”,“夏天”,“愛情”,“愛國”等。
拿“愛情”類別舉例,點進去發現鏈接變爲 https://so.gushiwen.org/gushi/aiqing.aspx
ctrl+shift+I 觀察html頁面結構 發現每個詩的鏈接都藏在<div class="typecont">下,通過xpath很容易獲取鏈接。
base_url 加上這裏<a href>後面的鏈接即可找到詩詞所在頁面:
同樣的道理獲取朝代、詩人、內容。
獲取內容時要注意,有的格式爲<div> 文本</div>,但是有的格式爲<div> <p>文本</p> </div>
所以要先獲取到上一級之後 再用 .xpath("string(.)")。 全部代碼如下:
import re
import requests
from lxml import etree
import os
import csv
class spider:
def __init__(self,start_url):
self.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36' #noqa
}
self.base_url='http://so.gushiwen.org'
def crawl_title(self):
html = requests.get(start_url,headers=self.headers).content
# print(html)
selector = etree.HTML(html)
poetry_link = selector.xpath("//div[@class='typecont']//@href ")
# print(title_str)
# title = re.findall()
file_path = os.path.split(os.path.realpath(__file__))[
0] + os.sep +"aiqing.csv"
csvfile = open(file_path,"a+",encoding='utf-8',newline='')
for link in poetry_link:
url = self.base_url + link
# print(url)
res=requests.get(url,headers=self.headers).content
selector = etree.HTML(res)
title=selector.xpath("//div[@class='cont']//h1/text()")[0]
# print(title)
#朝代
dynasty_str = selector.xpath("//div[@class='cont']//p[@class='source']/a/text()")
dynasty = dynasty_str[0]
#作者
author = dynasty_str[1]
# print(dynasty)
# print(author)
#內容
c = selector.xpath("//div[@class='sons'][1]//div[@class='contson']")[0]
info = c.xpath("string(.)")
content = ''
content = content.join(info)
# print(content)
writer = csv.writer(csvfile)
data_row = [author,dynasty,title,content,"愛情"]
writer.writerow(data_row)
csvfile.close()
def start(self):
self.crawl_title()
if __name__ == '__main__':
start_url='http://so.gushiwen.org/gushi/aiqing.aspx'
pp = spider(start_url)
pp.start()
執行之後即生成aiqing.csv文件。
想要爬取其他類別修改一下start_url以及文件名即可。