Windows下安裝libxml2並在Python中使用XPath

爲了使用XPath技術，對爬蟲抓取的網頁數據進行抽取（如標題、正文等等），花了一天的時間熟悉了一下Python語言，今天嘗試在Windows下安裝libxml2模塊，將自己的一點學習實踐簡單記錄一下。

Python在安裝一個擴展的模塊時，可以通過安裝輔助工具包（Setuptools）來安裝新的Python packages，並可以實現對已經安裝的packages的管理。在http://pypi.python.org/pypi/setuptools上你可以找到對於不同平臺下的安裝包，這些工具主要包括Python Eggs和 Easy Install。在網上搜了很多，比較常用的應該是Easy Install，而且在網站http://peak.telecommunity.com/DevCenter/EasyInstall上給出了對EasyInstall的介紹：

Easy Install is a python module (easy_install) bundled with setuptools that lets you automatically download, build, install, and manage Python packages.

Easy Install是一個Python模塊，通過它可以方便地安裝擴展的Python模塊。

下面我們就一步步地準備、安裝、配置。

準備

需要的軟件包，及其相應的下載地址，分別整理如下：

Python 2.6 （python官網貌似打不開，也忘記從哪裏下載的，到網上搜一下吧）
libxml2-python-2.7.7.win32-py2.7.exe （http://xmlsoft.org/sources/win32/python/libxml2-python-2.7.7.win32-py2.7.exe，http://xmlsoft.org/sources/win32/python/）
setuptools-0.6c11.win32-py2.6.exe （http://pypi.python.org/packages/2.6/s/setuptools/setuptools-0.6c11.win32-py2.6.exe#md5=1509752c3c2e64b5d0f9589aafe053dc，http://pypi.python.org/pypi/setuptools#downloads）

安裝

第1步：安裝Python

在Windows下面，只需要安裝包的exe可行性文件即可安裝，不在累述。

第2步：安裝Easy Install工具

前提是Python要安裝好，然後安裝上面準備好的setuptools-0.6c11.win32-py2.6.exe即可，它會自動找到Python的安裝目錄，並將安裝工具包安裝到對應的目錄下面。例如我的腳本目錄爲E:\Program Files\Python26\Scripts，驗證一下：

E:\>cd E:\Program Files\Python26\Scripts
E:\Program Files\Python26\Scripts>easy_install --help

Global options:
  --verbose (-v)  run verbosely (default)
  --quiet (-q)    run quietly (turns verbosity off)
  --dry-run (-n)  don't actually do anything
  --help (-h)     show detailed help message

Options for 'easy_install' command:
  --prefix                       installation prefix
  --zip-ok (-z)                  install package as a zipfile
  --multi-version (-m)           make apps have to require() a version
  --upgrade (-U)                 force upgrade (searches PyPI for latest
                                 versions)
  --install-dir (-d)             install package to DIR
  --script-dir (-s)              install scripts to DIR
  --exclude-scripts (-x)         Don't install scripts
  --always-copy (-a)             Copy all needed packages to install dir
  --index-url (-i)               base URL of Python Package Index
  --find-links (-f)              additional URL(s) to search for packages
  --delete-conflicting (-D)      no longer needed; don't use this
  --ignore-conflicts-at-my-risk  no longer needed; don't use this
  --build-directory (-b)         download/extract/build in DIR; keep the
                                 results
  --optimize (-O)                also compile with optimization: -O1 for
                                 "python -O", -O2 for "python -OO", and -O0 to
                                 disable [default: -O0]
  --record                       filename in which to record list of installed
                                 files
  --always-unzip (-Z)            don't install as a zipfile, no matter what
  --site-dirs (-S)               list of directories where .pth files work
  --editable (-e)                Install specified packages in editable form
  --no-deps (-N)                 don't install dependencies
  --allow-hosts (-H)             pattern(s) that hostnames must match
  --local-snapshots-ok (-l)      allow building eggs from local checkouts

usage: easy_install-script.py [options] requirement_or_url ...
   or: easy_install-script.py --help

如果能夠看到上述easy_install的命令選項，就說明安裝成功了。

第3步：安裝libxml2

libxml2安裝，通過libxml2-python-2.7.7.win32-py2.7.exe安裝即可。安裝完這個以後，只是將相應的模塊解壓到了對應的目錄，並不能在Python編程中使用，還需要通過Easy Install來安裝一個lxml庫，它是一個C編寫的庫，能夠加速對HTML或XML的解析處理，詳細介紹可以參考（http://lxml.de/index.html）。安裝lxml需要使用Easy Install的執行腳本，例如我的腳本目錄爲E:\Program Files\Python26\Scripts，執行安裝：

E:\Program Files\Python26\Scripts>easy_install lxml==2.2.2

可以看到安裝信息：

E:\Program Files\Python26\Scripts>easy_install lxml==2.2.2
Searching for lxml==2.2.2
Reading http://pypi.python.org/simple/lxml/
Reading http://codespeak.net/lxml
Best match: lxml 2.2.2
Downloading http://pypi.python.org/packages/2.6/l/lxml/lxml-2.2.2-py2.6-win32.egg#md5=dc73ae17e486037580371077efdc13e9
Processing lxml-2.2.2-py2.6-win32.egg
creating e:\program files\python26\lib\site-packages\lxml-2.2.2-py2.6-win32.egg
Extracting lxml-2.2.2-py2.6-win32.egg to e:\program files\python26\lib\site-packages
Adding lxml 2.2.2 to easy-install.pth file

Installed e:\program files\python26\lib\site-packages\lxml-2.2.2-py2.6-win32.egg

Processing dependencies for lxml==2.2.2
Finished processing dependencies for lxml==2.2.2

使用XPath抽取

下面，我們使用XPath來實現網頁數據的抽取。這裏，我使用了一個Python的IDE工具——EasyEclipse for Python（Version: 1.3.1），可以直接創建Pydev Project，具體使用請查閱相關資料。

驗證可以使用XPath來定向抽取網頁數據，Python代碼如下：

import codecs
import sys
from lxml import etree

def readFile(file, decoding):
    html = ''
    try:
        html = open(file).read().decode(decoding)
    except:
        pass
    return html

def extract(file, decoding, xpath):
    html = readFile(file, decoding)
    tree = etree.HTML(html)
    return tree.xpath(xpath)

if __name__ == '__main__':
    sections = extract('peak.txt', 'utf-8', "//h3//a[@class='toc-backref']")
    for title in sections:
        print title.text

首先，把網頁http://peak.telecommunity.com/DevCenter/EasyInstall的源代碼下載下來，存儲到文件peak.txt中，編碼UTF-8；

然後，在Python中讀取該文件內容，使用XPath抽取頁面上每個段落的標題內容，最後輸出到控制檯上，結果如下所示：

Troubleshooting
Windows Notes
Multiple Python Versions
Restricting Downloads with 
Installing on Un-networked Machines
Packaging Others' Projects As Eggs
Creating your own Package Index
Password-Protected Sites
Controlling Build Options
Editing and Viewing Source Packages
Dealing with Installation Conflicts
Compressed Installation
Administrator Installation
Mac OS X "User" Installation
Creating a "Virtual" Python
"Traditional" 
Backward Compatibility

如果你足夠熟悉XPath，藉助於libxml2，你可以抽取網頁中任何你想要的內容。

千與

發佈了73 篇原創文章 · 獲贊 19 · 訪問量 110萬+

他的留言板關注

Windows下安裝libxml2並在Python中使用XPath

準備

安裝

使用XPath抽取

.NET開源強大、易於使用的緩存框架 - FusionCache

RHEL 5下安裝Scrapy-0.14.0.2841爬蟲框架

開發更新Solr索引的工具

Solr實現Low Level查詢解析（QParser）

JMX技術基礎

基於Solr 3.5搭建搜索服務器

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結