Python 爬蟲之數據解析模塊lxml基礎（附：xpath和解析器介紹）

原創

insist_way

2019-09-09 13:09

介紹：

最近在學Python爬蟲，在這裏對數據解析模塊lxml做個學習筆記。

lxml、xpath及解析器介紹：

lxml是Python的一個解析庫，支持HTML和XML的解析，支持xpath解析方式，而且解析效率非常高。xpath，全稱XML Path Language，即XML路徑語言，它是一門在XML文檔中查找信息的語言，它最初是用來搜尋XML文檔的，但是它同樣適用於HTML文檔的搜索

xml文件/html文件結點關係：

父節點(Parent)

子節點(Children)

同胞節點(Sibling)

先輩節點(Ancestor)

後代節點(Descendant)

xpath語法:

nodename 選取此節點的所有子節點

// 從任意子節點中選取

/ 從根節點選取

. 選取當前節點

.. 選取當前節點的父節點

@ 選取屬性

解析器比較:

解析器速度難度

re 最快難

BeautifulSoup 慢非常簡單

lxml 快簡單

學習筆記：

# -*- coding: utf-8 -*-

from lxml import etree

html_doc = """

<html><head><title>The Dormouse's story</title></head>

<body>

The Dormouse's story

Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class=... ... ... ... ... ... "sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" id="link2">Lacie</a> and

<a href="http://example.com/tillie" id="link3">Tillie</a>;

and they lived at the bottom of a well.

...

"""

selector = etree.HTML(html_doc) #創建一個對象

links = selector.xpath('//p[@class="story"]/a/@href') # 取出頁面內所有的鏈接

for link in links:

print link

xml_test = """

<?xml version='1.0'?>

<?xml-stylesheet type="text/css" href="first.css"?>

<name>lizibin</name>

<email>[email protected]</email>

</concat>

</user>

<address>shanghai</address>

<email>[email protected]</email>

</concat>

</user>

<name>liqian</name>

<email>[email protected]</email>

</concat>

</user>

<name>qiangli</name>

<email>[email protected]</email>

</concat>

</user>

<name>buzhidao</name>

<email>[email protected]</email>

</concat>

</user>

</notebook>

"""

#r = requests.get('http://xxx.com/abc.xml') 也可以請求遠程服務器上的xml文件

#etree.HTML(r.text.encode('utf-8'))

xml_code = etree.HTML(xml_test) #生成一個etree對象

#選取所有子節點的name(地址)

print xml_code.xpath('//name')

選取所有子節點的name值(數據)

print xml_code.xpath('//name/text()')

print ''

#以notebook以根節點選取所有數據

notebook = xml_code.xpath('//notebook')

#取出第一個節點的name值(數據)

print notebook[0].xpath('.//name/text()')[0]

addres = notebook[0].xpath('.//name')[0]

#取出和第一個節點同級的 address 值

print addres.xpath('../address/text()')

#選取屬性值

print addres.xpath('../address/@lang')

#選取notebook下第一個user的name屬性

print xml_code.xpath('//notebook/user[1]/name/text()')

#選取notebook下最後一個user的name屬性

print xml_code.xpath('//notebook/user[last()]/name/text()')

#選取notebook下倒數第二個user的name屬性

print xml_code.xpath('//notebook/user[last()-1]/name/text()')

#選取notebook下前兩名user的address屬性

print xml_code.xpath('//notebook/user[position()<3]/address/text()')

#選取所有分類爲web的name

print xml_code.xpath('//notebook/user[@category="cb"]/name/text()')

#選取所有年齡小於30的人

print xml_code.xpath('//notebook/user[age<30]/name/text()')

#選取所有class屬性中包含dba的class屬性

print xml_code.xpath('//notebook/user[contains(@class,"dba")]/@class')

print xml_code.xpath('//notebook/user[contains(@class,"dba")]/name/text()')

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python 爬蟲之數據解析模塊lxml基礎（附：xpath和解析器介紹）

MySQL5.6 - 基於GTID複製模式搭建主從複製、故障模擬、問題解決！

mysq命令行下pager用法小技巧

Python 對聊天記錄進行拆分，找出用戶最關心的諮詢問題！

Python 爬蟲之數據解析模塊lxml基礎（附：xpath和解析器介紹）

Python 爬蟲之數據解析模塊bs4基礎

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結