爲了使用XPath技術,對爬蟲抓取的網頁數據進行抽取(如標題、正文等等),花了一天的時間熟悉了一下Python語言,今天嘗試在Windows下安裝libxml2模塊,將自己的一點學習實踐簡單記錄一下。
Python在安裝一個擴展的模塊時,可以通過安裝輔助工具包(Setuptools)來安裝新的Python packages,並可以實現對已經安裝的packages的管理。在http://pypi.python.org/pypi/setuptools上你可以找到對於不同平臺下的安裝包,這些工具主要包括Python Eggs和 Easy Install。在網上搜了很多,比較常用的應該是Easy Install,而且在網站http://peak.telecommunity.com/DevCenter/EasyInstall上給出了對EasyInstall的介紹:
Easy Install is a python module (easy_install) bundled with setuptools that lets you automatically download, build, install, and manage Python packages.
Easy Install是一個Python模塊,通過它可以方便地安裝擴展的Python模塊。
下面我們就一步步地準備、安裝、配置。
準備
需要的軟件包,及其相應的下載地址,分別整理如下:
Python 2.6 (python官網貌似打不開,也忘記從哪裏下載的,到網上搜一下吧)
libxml2-python-2.7.7.win32-py2.7.exe (http://xmlsoft.org/sources/win32/python/libxml2-python-2.7.7.win32-py2.7.exe,http://xmlsoft.org/sources/win32/python/)
setuptools-0.6c11.win32-py2.6.exe (http://pypi.python.org/packages/2.6/s/setuptools/setuptools-0.6c11.win32-py2.6.exe#md5=1509752c3c2e64b5d0f9589aafe053dc,http://pypi.python.org/pypi/setuptools#downloads)
安裝
第1步:安裝Python
在Windows下面,只需要安裝包的exe可行性文件即可安裝,不在累述。
第2步:安裝Easy Install工具
前提是Python要安裝好,然後安裝上面準備好的setuptools-0.6c11.win32-py2.6.exe即可,它會自動找到Python的安裝目錄,並將安裝工具包安裝到對應的目錄下面。例如我的腳本目錄爲E:\Program Files\Python26\Scripts,驗證一下:
E:\>cd E:\Program Files\Python26\Scripts
E:\Program Files\Python26\Scripts>easy_install --help
Global options:
--verbose (-v) run verbosely (default)
--quiet (-q) run quietly (turns verbosity off)
--dry-run (-n) don't actually do anything
--help (-h) show detailed help message
Options for 'easy_install' command:
--prefix installation prefix
--zip-ok (-z) install package as a zipfile
--multi-version (-m) make apps have to require() a version
--upgrade (-U) force upgrade (searches PyPI for latest
versions)
--install-dir (-d) install package to DIR
--script-dir (-s) install scripts to DIR
--exclude-scripts (-x) Don't install scripts
--always-copy (-a) Copy all needed packages to install dir
--index-url (-i) base URL of Python Package Index
--find-links (-f) additional URL(s) to search for packages
--delete-conflicting (-D) no longer needed; don't use this
--ignore-conflicts-at-my-risk no longer needed; don't use this
--build-directory (-b) download/extract/build in DIR; keep the
results
--optimize (-O) also compile with optimization: -O1 for
"python -O", -O2 for "python -OO", and -O0 to
disable [default: -O0]
--record filename in which to record list of installed
files
--always-unzip (-Z) don't install as a zipfile, no matter what
--site-dirs (-S) list of directories where .pth files work
--editable (-e) Install specified packages in editable form
--no-deps (-N) don't install dependencies
--allow-hosts (-H) pattern(s) that hostnames must match
--local-snapshots-ok (-l) allow building eggs from local checkouts
usage: easy_install-script.py [options] requirement_or_url ...
or: easy_install-script.py --help
如果能夠看到上述easy_install的命令選項,就說明安裝成功了。
第3步:安裝libxml2
libxml2安裝,通過libxml2-python-2.7.7.win32-py2.7.exe安裝即可。安裝完這個以後,只是將相應的模塊解壓到了對應的目錄,並不能在Python編程中使用,還需要通過Easy Install來安裝一個lxml庫,它是一個C編寫的庫,能夠加速對HTML或XML的解析處理,詳細介紹可以參考(http://lxml.de/index.html)。安裝lxml需要使用Easy Install的執行腳本,例如我的腳本目錄爲E:\Program
Files\Python26\Scripts,執行安裝:
E:\Program Files\Python26\Scripts>easy_install lxml==2.2.2
可以看到安裝信息:
E:\Program Files\Python26\Scripts>easy_install lxml==2.2.2
Searching for lxml==2.2.2
Reading http://pypi.python.org/simple/lxml/
Reading http://codespeak.net/lxml
Best match: lxml 2.2.2
Downloading http://pypi.python.org/packages/2.6/l/lxml/lxml-2.2.2-py2.6-win32.egg#md5=dc73ae17e486037580371077efdc13e9
Processing lxml-2.2.2-py2.6-win32.egg
creating e:\program files\python26\lib\site-packages\lxml-2.2.2-py2.6-win32.egg
Extracting lxml-2.2.2-py2.6-win32.egg to e:\program files\python26\lib\site-packages
Adding lxml 2.2.2 to easy-install.pth file
Installed e:\program files\python26\lib\site-packages\lxml-2.2.2-py2.6-win32.egg
Processing dependencies for lxml==2.2.2
Finished processing dependencies for lxml==2.2.2
使用XPath抽取
下面,我們使用XPath來實現網頁數據的抽取。這裏,我使用了一個Python的IDE工具——EasyEclipse for Python(Version: 1.3.1),可以直接創建Pydev Project,具體使用請查閱相關資料。
驗證可以使用XPath來定向抽取網頁數據,Python代碼如下:
import codecs
import sys
from lxml import etree
def readFile(file, decoding):
html = ''
try:
html = open(file).read().decode(decoding)
except:
pass
return html
def extract(file, decoding, xpath):
html = readFile(file, decoding)
tree = etree.HTML(html)
return tree.xpath(xpath)
if __name__ == '__main__':
sections = extract('peak.txt', 'utf-8', "//h3//a[@class='toc-backref']")
for title in sections:
print title.text
首先,把網頁http://peak.telecommunity.com/DevCenter/EasyInstall的源代碼下載下來,存儲到文件peak.txt中,編碼UTF-8;
然後,在Python中讀取該文件內容,使用XPath抽取頁面上每個段落的標題內容,最後輸出到控制檯上,結果如下所示:
Troubleshooting
Windows Notes
Multiple Python Versions
Restricting Downloads with
Installing on Un-networked Machines
Packaging Others' Projects As Eggs
Creating your own Package Index
Password-Protected Sites
Controlling Build Options
Editing and Viewing Source Packages
Dealing with Installation Conflicts
Compressed Installation
Administrator Installation
Mac OS X "User" Installation
Creating a "Virtual" Python
"Traditional"
Backward Compatibility
如果你足夠熟悉XPath,藉助於libxml2,你可以抽取網頁中任何你想要的內容。