xpath簡介
xpath(XML Path Language)是一門在XML和HTML中查找信息的語言。
Xpath開發工具
1.Chrome插件Xpath Helper
點擊瀏覽器右邊三點—更多工具—擴展程序—chrome網上商店搜索該插件(需要翻牆,可能一次不能成功添加,多嘗試即可)
2.Firefox插件Try Xpath
點擊瀏覽器右邊三橫—附加組件—搜索插件—添加
Xpath語法
- 謂語
/bookstore/book[1] 選取bookstore下的第一個book子元素
/bookstore/book[last()] 選取bookstore下倒數第二個book元素
/bookstore/book[position()❤️] 選取bookstore下前兩個book子元素
//book[@price=10] 選取所有屬性price=10的book元素 - 通配符
/bookstore/* 選取bookstore下所有子元素
//book[@*] 選取所有帶有屬性的book元素
lxml解析庫
lxml是c語言寫的
1.使用lxml解析html字符串,使用lxml.etree.HTML
進行解析
from lxml import etree
htmlElement = etree.HTML(text)
print(etree.tostring(htmlElement,encoding='utf-8').decode('utf-8'))
2.解析html文件,使用lxml.etree.parse
進行解析
from lxml import etree
htmlElement = etree.parse("tencent.html")
print(etree.tostring(htmlElement,encoding='utf-8').decode('utf-8'))
parse函數默認使用XML解析器,如果碰到不規範的html代碼時會解析錯誤,此時要自己創建html解析器
from lxml import etree
parser = etree.HTMLParser(encoding='utf-8')
htmlElement = etree.parse("tencent.html",parser=parser)
print(etree.tostring(htmlElement,encoding='utf-8').decode('utf-8'))
lxml和xpath結合使用
1.xpath返回的是一個列表
獲取所有tr標籤
from lxml import etree
parser = etree.HTMLParser(encoding='utf-8')
htmlElement = etree.parse("tencent.html",parser=parser)
trs=htmlElement.xpath("//tr")
for tr in trs:
print(etree.tostring(tr,encoding='utf-8').decode('utf-8'))
2.獲取第二個tr標籤
from lxml import etree
parser = etree.HTMLParser(encoding='utf-8')
htmlElement = etree.parse("tencent.html",parser=parser)
tr=htmlElement.xpath("//tr[2]")[0] #如果列表中只有一個元素,在後面加[0]可以獲取列表內的元素
print(etree.tostring(tr,encoding='utf-8').decode('utf-8'))
3.獲取所有class屬性等於even的tr標籤
from lxml import etree
parser = etree.HTMLParser(encoding='utf-8')
htmlElement = etree.parse("tencent.html",parser=parser)
tr=htmlElement.xpath("//tr[@class='even']") #如果列表中只有一個元素,在後面加[0]可以獲取列表內的元素
for tr in trs:
print(etree.tostring(tr,encoding='utf-8').decode('utf-8'))
4.獲取所有職位信息(純文本)
from lxml import etree
parser = etree.HTMLParser(encoding='utf-8')
htmlElement = etree.parse("tencent.html",parser=parser)
trs=htmlElement.xpath("//tr[position()>1']")
for tr in trs:
href=tr.xpath(".//a/@href")[0] #.//加.表示獲取這個元素(tr)下的所有a標籤的內容
fullurl = 'http://hr.tencent.com/'+href
title = tr.xpath("./td[1]//text()")[0]
category = tr.xpath("./td[2]/text()")[0]
nums= tr.xpath("./td[3]//text()")[0]
address = tr.xpath("./td[4]/text()")[0]
pubtime = tr.xpath("./td[5]/text()")[0]
position ={
'url':fullurl,
'title':title,
'category':category,
'nums':nums,
'address':address,
'pubtime': pubtime,
}
position.append(position)
print(position)
豆瓣電影爬蟲(正在上映)
import requests
from lxml import etree
# 抓取目標網站上的頁面
headers={
'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36",
'Referer':'https://www.google.com/_/chrome/newtab?ie=UTF-8'
}
url = 'https://movie.douban.com/cinema/nowplaying/changsha/'
response= requests.get(url,headers=headers)
# print(response.text)
text =response.text
# 2將抓取下來的數據根據一定規則進行提取
html =etree.HTML(text)
ul = html.xpath("//ul[@class='lists']")[0] #因爲有正在上映和即將上映模塊ul的class都是lists,取第一個正在上映的內容就可以
lis = ul.xpath("./li") #ul下的所有子li
movies=[]
for li in lis:
title=li.xpath("@data-title")[0]
score=li.xpath("@data-score")[0]
photo=li.xpath(".//img/@src")[0]
movie ={
'title':title,
'score':score,
'photo':photo
}
movies.append(movie)
print(movies)