python學習筆記6---數據解析

xpath簡介

xpath(XML Path Language)是一門在XML和HTML中查找信息的語言。

Xpath開發工具

1.Chrome插件Xpath Helper
點擊瀏覽器右邊三點—更多工具—擴展程序—chrome網上商店搜索該插件(需要翻牆,可能一次不能成功添加,多嘗試即可)
2.Firefox插件Try Xpath
點擊瀏覽器右邊三橫—附加組件—搜索插件—添加

Xpath語法

  • 謂語
    /bookstore/book[1] 選取bookstore下的第一個book子元素
    /bookstore/book[last()] 選取bookstore下倒數第二個book元素
    /bookstore/book[position()❤️] 選取bookstore下前兩個book子元素
    //book[@price=10] 選取所有屬性price=10的book元素
  • 通配符
    /bookstore/* 選取bookstore下所有子元素
    //book[@*] 選取所有帶有屬性的book元素

lxml解析庫

lxml是c語言寫的
1.使用lxml解析html字符串,使用lxml.etree.HTML進行解析

from lxml import etree
htmlElement = etree.HTML(text)
print(etree.tostring(htmlElement,encoding='utf-8').decode('utf-8'))

2.解析html文件,使用lxml.etree.parse進行解析

from lxml import etree
htmlElement = etree.parse("tencent.html")
print(etree.tostring(htmlElement,encoding='utf-8').decode('utf-8'))

parse函數默認使用XML解析器,如果碰到不規範的html代碼時會解析錯誤,此時要自己創建html解析器

from lxml import etree
parser = etree.HTMLParser(encoding='utf-8')
htmlElement = etree.parse("tencent.html",parser=parser)
print(etree.tostring(htmlElement,encoding='utf-8').decode('utf-8'))

lxml和xpath結合使用

1.xpath返回的是一個列表
獲取所有tr標籤

from lxml import etree

parser = etree.HTMLParser(encoding='utf-8')
htmlElement = etree.parse("tencent.html",parser=parser)
trs=htmlElement.xpath("//tr")
for tr in trs:
    print(etree.tostring(tr,encoding='utf-8').decode('utf-8'))

2.獲取第二個tr標籤

from lxml import etree

parser = etree.HTMLParser(encoding='utf-8')
htmlElement = etree.parse("tencent.html",parser=parser)
tr=htmlElement.xpath("//tr[2]")[0] #如果列表中只有一個元素,在後面加[0]可以獲取列表內的元素
print(etree.tostring(tr,encoding='utf-8').decode('utf-8'))

3.獲取所有class屬性等於even的tr標籤

from lxml import etree

parser = etree.HTMLParser(encoding='utf-8')
htmlElement = etree.parse("tencent.html",parser=parser)
tr=htmlElement.xpath("//tr[@class='even']") #如果列表中只有一個元素,在後面加[0]可以獲取列表內的元素
for tr in trs:
    print(etree.tostring(tr,encoding='utf-8').decode('utf-8'))

4.獲取所有職位信息(純文本)

from lxml import etree

parser = etree.HTMLParser(encoding='utf-8')
htmlElement = etree.parse("tencent.html",parser=parser)
trs=htmlElement.xpath("//tr[position()>1']")
for tr in trs:
    href=tr.xpath(".//a/@href")[0] #.//加.表示獲取這個元素(tr)下的所有a標籤的內容
    fullurl = 'http://hr.tencent.com/'+href
    title = tr.xpath("./td[1]//text()")[0]
    category = tr.xpath("./td[2]/text()")[0]
   nums= tr.xpath("./td[3]//text()")[0]
    address = tr.xpath("./td[4]/text()")[0]
    pubtime = tr.xpath("./td[5]/text()")[0]

    position ={
        'url':fullurl,
       'title':title,
       'category':category,
       'nums':nums,
       'address':address,
        'pubtime': pubtime,
}
     position.append(position)
print(position)

豆瓣電影爬蟲(正在上映)

import requests
from lxml import etree
# 抓取目標網站上的頁面
headers={
'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36",
'Referer':'https://www.google.com/_/chrome/newtab?ie=UTF-8'
}
url = 'https://movie.douban.com/cinema/nowplaying/changsha/'
response= requests.get(url,headers=headers)
# print(response.text)
text =response.text
# 2將抓取下來的數據根據一定規則進行提取
html =etree.HTML(text)
ul = html.xpath("//ul[@class='lists']")[0] #因爲有正在上映和即將上映模塊ul的class都是lists,取第一個正在上映的內容就可以
lis = ul.xpath("./li") #ul下的所有子li
movies=[]
for li in lis:
    title=li.xpath("@data-title")[0]
    score=li.xpath("@data-score")[0]
    photo=li.xpath(".//img/@src")[0]
    movie ={
        'title':title,
        'score':score,
        'photo':photo
     }
     movies.append(movie)

print(movies)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章