文章目錄
嚴格來說,本篇表單交互和下一篇驗證碼處理不算是網絡爬蟲,而是廣義上的網絡機器人。使用網絡機器人可以減少提取數據時需要表單交互的一道門檻。
1.手工處理髮送POST請求提交登錄表單
我們先在示例網站手工註冊一個賬號,註冊這個賬號需要驗證碼,下一篇會介紹處理驗證碼問題。
1.1分析表單內容
我們在登錄網址http://127.0.0.1:8000/places/default/user/login 獲得如下表單。在下面登錄表單中包括幾個重要的組成部分:
- form標籤的action屬性:用於設置表單數據提交的地址,本例中爲
#
,也就是和登錄表單同一個URL; - form標籤的enctype屬性:用於設置數據提交的編碼,本例中爲
application/x-www-form-urlencoded
,表示所有非字母數字的字符都需要轉換爲十六進制的ASCII值;上傳二進程文件最好用multipart/form-data
編碼類型,這種編碼不會對輸入進行編碼從而不會影響效率,而是使用MIME協議將其作爲多個部分進行發送,和郵件的傳輸標準相同。文檔:http://www.w3.org/TR/html5/forms.html#selecting-a-form-submission-encoding - form標籤的method屬性:本例中
post
表示通過請求體向服務器提交表單數據; - imput標籤的name屬性:用於設定提交到服務器端時某個域的名稱。
<form action="#" enctype="application/x-www-form-urlencoded" method="post">
<table>
<tr id="auth_user_email__row">
<td class="w2p_fl"><label class="" for="auth_user_email" id="auth_user_email__label">E-mail: </label></td>
<td class="w2p_fw"><input class="string" id="auth_user_email" name="email" type="text" value="" /></td>
<td class="w2p_fc"></td>
</tr>
<tr id="auth_user_password__row">
<td class="w2p_fl"><label class="" for="auth_user_password" id="auth_user_password__label">Password: </label></td>
<td class="w2p_fw"><input class="password" id="auth_user_password" name="password" type="password" value="" /></td>
<td class="w2p_fc"></td>
</tr>
<tr id="auth_user_remember_me__row">
<td class="w2p_fl"><label class="" for="auth_user_remember_me" id="auth_user_remember_me__label">Remember me (for 30 days): </label></td>
<td class="w2p_fw"><input class="boolean" id="auth_user_remember_me" name="remember_me" type="checkbox" value="on" /></td>
<td class="w2p_fc"></td>
</tr>
<tr id="submit_record__row">
<td class="w2p_fl"></td><td class="w2p_fw">
<input type="submit" value="Log In" />
<button class="btn w2p-form-button" οnclick="window.location='/places/default/user/register';return false">Register</button>
</td>
<td class="w2p_fc"></td>
</tr>
</table>
<div style="display:none;">
<input name="_next" type="hidden" value="/places/default/index" />
<input name="_formkey" type="hidden" value="7b1add4b-fa91-4301-975e-b6fbf7def3ac" />
<input name="_formname" type="hidden" value="login" />
</div>
</form>
1.2手工測試post請求提交表單
如果登錄成功則跳到主頁,否則回到登錄頁。下面是嘗試自動登錄的初始版本代碼。顯然登錄失敗!
>>> import urllib,urllib2
>>> LOGIN_URL='http://127.0.0.1:8000/places/default/user/login'
>>> LOGIN_EMAIL='[email protected]'
>>> LOGIN_PASSWORD='wu.com'
>>> data={'email':LOGIN_EMAIL,'password':LOGIN_PASSWORD}
>>> encoded_data=urllib.urlencode(data)
>>> request=urllib2.Request(LOGIN_URL,encoded_data)
>>> response=urllib2.urlopen(request)
>>> response.geturl()
'http://127.0.0.1:8000/places/default/user/login'
>>>
因爲登錄時還需要添加隱藏的_formkey
屬性,這個唯一的ID用來避免表單多次提交。每次加載網頁時,都會產生不同的ID,然後服務器端就可以通過這個給定的ID來判斷表單是否已經通過提交過。下面是獲得該屬性值:
>>>
>>> import lxml.html
>>> def parse_form(html):
... tree=lxml.html.fromstring(html)
... data={}
... for e in tree.cssselect('form input'):
... if e.get('name'):
... data[e.get('name')]=e.get('value')
... return data
...
>>> import pprint
>>> html=urllib2.urlopen(LOGIN_URL).read()
>>> form=parse_form(html)
>>> pprint.pprint(form)
{'_formkey': '437e4660-0c44-4187-af8d-36487c62ffce',
'_formname': 'login',
'_next': '/places/default/index',
'email': '',
'password': '',
'remember_me': 'on'}
>>>
下面是通過_formkey
和其他隱藏域的新版本自動登錄代碼。發現還是不成功!
>>>
>>> html=urllib2.urlopen(LOGIN_URL).read()
>>> data=parse_form(html)
>>> data['email']=LOGIN_EMAIL
>>> data['password']=LOGIN_PASSWORD
>>> encoded_data=urllib.urlencode(data)
>>> request=urllib2.Request(LOGIN_URL,encoded_data)
>>> response=urllib2.urlopen(request)
>>> response.geturl()
'http://127.0.0.1:8000/places/default/user/login'
>>>
因爲我們缺失了一個重要的組成部分——cookie。當普通用戶加載登錄表單時,_formkey
的值將會保存在cookie中,然後該值會與提交的登錄表單數據中的_formkey
的值進行對比。下面是使用urllib2.HTTPCookieProcessor
類增加了cookie支持之後的代碼。最後登錄成功了!
>>>
>>> import cookielib
>>> cj=cookielib.CookieJar()
>>> opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
>>>
>>> html=opener.open(LOGIN_URL).read() #opener
>>> data=parse_form(html)
>>> data['email']=LOGIN_EMAIL
>>> data['password']=LOGIN_PASSWORD
>>> encoded_data=urllib.urlencode(data)
>>> request=urllib2.Request(LOGIN_URL,encoded_data)
>>> response=opener.open(request) #opener
>>> response.geturl()
'http://127.0.0.1:8000/places/default/index'
>>>
1.3手工處理post請求登錄的完整源代碼:
# -*- coding: utf-8 -*-
import urllib
import urllib2
import cookielib
import lxml.html
LOGIN_EMAIL = '[email protected]'
LOGIN_PASSWORD = 'wu.com'
#LOGIN_URL = 'http://example.webscraping.com/user/login'
LOGIN_URL = 'http://127.0.0.1:8000/places/default/user/login'
def login_basic():
"""fails because not using formkey
"""
data = {'email': LOGIN_EMAIL, 'password': LOGIN_PASSWORD}
encoded_data = urllib.urlencode(data)
request = urllib2.Request(LOGIN_URL, encoded_data)
response = urllib2.urlopen(request)
print response.geturl()
def login_formkey():
"""fails because not using cookies to match formkey
"""
html = urllib2.urlopen(LOGIN_URL).read()
data = parse_form(html)
data['email'] = LOGIN_EMAIL
data['password'] = LOGIN_PASSWORD
encoded_data = urllib.urlencode(data)
request = urllib2.Request(LOGIN_URL, encoded_data)
response = urllib2.urlopen(request)
print response.geturl()
def login_cookies():
"""working login
"""
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
html = opener.open(LOGIN_URL).read()
data = parse_form(html)
data['email'] = LOGIN_EMAIL
data['password'] = LOGIN_PASSWORD
encoded_data = urllib.urlencode(data)
request = urllib2.Request(LOGIN_URL, encoded_data)
response = opener.open(request)
print response.geturl()
return opener
def parse_form(html):
"""extract all input properties from the form
"""
tree = lxml.html.fromstring(html)
data = {}
for e in tree.cssselect('form input'):
if e.get('name'):
data[e.get('name')] = e.get('value')
return data
def main():
#login_basic()
#login_formkey()
login_cookies()
if __name__ == '__main__':
main()
2.從FF瀏覽器加載cookie登錄網站
我們先用手工執行登錄,我們先在FF瀏覽器用手工執行登錄,然後關閉FF瀏覽器,然後用python腳本複用之前得到的cookie,從而實現自動登錄。
2.1session文件位置
FireFox在sqlist數據庫中存儲cookie,在json文件中存儲session,這兩種存儲方式都可以直接通過Python獲取。對於登錄操作而言,我們只需要獲致session即可。對於不同的操作系統,FireFox存儲的session文件的位置不同:
- Linux系統:
~/.mozilla/firefox/*.default/sessionstore.js
- OS X系統:
~/Library/Application Support/Firefox/Profiles/*.default/sessionstore.js
- Windows Vista及以上版本系統:
%APPDATA%/Roaming/Mozilla/Firefox/Profiles/*.default/sessionstore.js
下面是返回session文件路徑的輔助函數代碼:
def find_ff_sessions():
paths = [
'~/.mozilla/firefox/*.default',
'~/Library/Application Support/Firefox/Profiles/*.default',
'%APPDATA%/Roaming/Mozilla/Firefox/Profiles/*.default'
]
for path in paths:
filename = os.path.join(path, 'sessionstore.js')
matches = glob.glob(os.path.expanduser(filename))
if matches:
return matches[0]
注:glob
模塊會返回指定路徑中所有匹配的文件。
2.2FF瀏覽器cookie內容
下面是Linux系統火狐瀏覽器session文件內容:
wu_being@ubuntukylin64:~/.mozilla/firefox/78n340f7.default$ ls
addons.json datareporting key3.db prefs.js storage
blocklist.xml extensions logins.json revocations.txt storage.sqlite
bookmarkbackups extensions.ini mimeTypes.rdf saved-telemetry-pings times.json
cert8.db extensions.json minidumps search.json.mozlz4 webapps
compatibility.ini features permissions.sqlite secmod.db webappsstore.sqlite
containers.json formhistory.sqlite places.sqlite sessionCheckpoints.json xulstore.json
content-prefs.sqlite gmp places.sqlite-shm sessionstore-backups
cookies.sqlite gmp-gmpopenh264 places.sqlite-wal sessionstore.js
crashes healthreport pluginreg.dat SiteSecurityServiceState.txt
wu_being@ubuntukylin64:~/.mozilla/firefox/78n340f7.default$ more sessionstore.js
{"version":["sessionrestore",1],
"windows":[{
...
"cookies":[
{"host":"127.0.0.1",
"value":"127.0.0.1-aabe0222-d083-44ee-94c8-e9343eefb2e5",
"path":"/",
"name":"session_id_welcome",
"httponly":true,
"originAttributes":{"addonId":"","appId":0,"inIsolatedMozBrowser":false,"privateBrowsingId":0,"signedPkg":"","userContextId":0}},
{"host":"127.0.0.1",
"value":"True",
"path":"/",
"name":"session_id_places",
"httponly":true,
"originAttributes":{"addonId":"","appId":0,"inIsolatedMozBrowser":false,"privateBrowsingId":0,"signedPkg":"","userContextId":0}},
{"host":"127.0.0.1",
"value":"\":oJoAPvH-ODMFDXwk3U...su0Dxr7doAgu9yQiSEmgQiSy98Ga7C6K2tIQoZwzY0_4wBO0qHm-FlcBf-cPRk7GPAhix8yS4roOVIvMqP5I7ZB_uIA==\"",
"path":"/",
"name":"session_data_places",
"originAttributes":{"addonId":"","appId":0,"inIsolatedMozBrowser":false,"privateBrowsingId":0,"signedPkg":"","userContextId":0}}
],
"title":"Example web scraping website",
"_shouldRestore":true,
"closedAt":1485228738310
}],
"selectedWindow":0,
"_closedWindows":[],
"session":{"lastUpdate":1485228738927,"startTime":1485226675190,"recentCrashes":0},
"global":{}
}
wu_being@ubuntukylin64:~/.mozilla/firefox/78n340f7.default$
根據seesion存儲結構,我們用下面代碼把session解析到CookieJar對象中。
def load_ff_sessions(session_filename):
cj = cookielib.CookieJar()
if os.path.exists(session_filename):
try:
json_data = json.loads(open(session_filename, 'rb').read())
except ValueError as e:
print 'Error parsing session JSON:', str(e)
else:
for window in json_data.get('windows', []):
for cookie in window.get('cookies', []):
import pprint; pprint.pprint(cookie)
c = cookielib.Cookie(0, cookie.get('name', ''), cookie.get('value', ''),
None, False,
cookie.get('host', ''), cookie.get('host', '').startswith('.'), cookie.get('host', '').startswith('.'),
cookie.get('path', ''), False,
False, str(int(time.time()) + 3600 * 24 * 7), False,
None, None, {})
cj.set_cookie(c)
else:
print 'Session filename does not exist:', session_filename
return cj
2.3使用cookie測試加載登錄
session_filename = find_ff_sessions()
cj = load_ff_sessions(session_filename)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
html = opener.open(COUNTRY_URL).read()
tree = lxml.html.fromstring(html)
print tree.cssselect('ul#navbar li a')[0].text_content()
如果得到的結果是Login
則說明沒能正確加載。如果出現這樣情況,你就需要確認一下FireFox中是否已經成功登錄救命網站。如果得到下面結果,有Welcome 用戶的first name
,則登錄表示成功。
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表單交互$ python 2login_firefox.py
{u'host': u'127.0.0.1',
u'httponly': True,
u'name': u'session_id_welcome',
u'originAttributes': {u'addonId': u'',
u'appId': 0,
u'inIsolatedMozBrowser': False,
u'privateBrowsingId': 0,
u'signedPkg': u'',
u'userContextId': 0},
u'path': u'/',
u'value': u'127.0.0.1-406df419-ed33-4de5-bc46-cd2d9f3c431b'}
Log In
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表單交互$
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表單交互$ python 2login_firefox.py
{u'host': u'127.0.0.1',
u'httponly': True,
u'name': u'session_id_welcome',
u'originAttributes': {u'addonId': u'',
u'appId': 0,
u'inIsolatedMozBrowser': False,
u'privateBrowsingId': 0,
u'signedPkg': u'',
u'userContextId': 0},
u'path': u'/',
u'value': u'127.0.0.1-aabe0222-d083-44ee-94c8-e9343eefb2e5'}
{u'host': u'127.0.0.1',
u'httponly': True,
u'name': u'session_id_places',
u'originAttributes': {u'addonId': u'',
u'appId': 0,
u'inIsolatedMozBrowser': False,
u'privateBrowsingId': 0,
u'signedPkg': u'',
u'userContextId': 0},
u'path': u'/',
u'value': u'True'}
{u'host': u'127.0.0.1',
u'name': u'session_data_places',
u'originAttributes': {u'addonId': u'',
u'appId': 0,
u'inIsolatedMozBrowser': False,
u'privateBrowsingId': 0,
u'signedPkg': u'',
u'userContextId': 0},
u'path': u'/',
u'value': u'"ef34329782d4efe136522cb44fc4bd21:oJoAPvH-ODM...QiSy98Ga7C6K2tIQoZwzY0_4wBO0qHm-FlcBf-cPRk7GPAhix8yS4roOVIvMqP5I7ZB_uIA=="'}
Welcome Wu
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表單交互$
如果你想從其他瀏覽器的cookie,可以使用browsercookie
模塊。通過pip install browsercookie
命令進行安裝,文檔:https://pypi.python.org/pypi/browsercookie
2.4使用cookie登錄源代碼
# -*- coding: utf-8 -*-
import urllib2
import glob
import os
import cookielib
import json
import time
import lxml.html
COUNTRY_URL = 'http://127.0.0.1:8000/places/default/edit/China-47'
def login_firefox():
"""load cookies from firefox
"""
session_filename = find_ff_sessions()
cj = load_ff_sessions(session_filename)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
html = opener.open(COUNTRY_URL).read()
tree = lxml.html.fromstring(html)
print tree.cssselect('ul#navbar li a')[0].text_content()
return opener
def load_ff_sessions(session_filename):
cj = cookielib.CookieJar()
if os.path.exists(session_filename):
try:
json_data = json.loads(open(session_filename, 'rb').read())
except ValueError as e:
print 'Error parsing session JSON:', str(e)
else:
for window in json_data.get('windows', []):
for cookie in window.get('cookies', []):
import pprint; pprint.pprint(cookie)
c = cookielib.Cookie(0, cookie.get('name', ''), cookie.get('value', ''),
None, False,
cookie.get('host', ''), cookie.get('host', '').startswith('.'), cookie.get('host', '').startswith('.'),
cookie.get('path', ''), False,
False, str(int(time.time()) + 3600 * 24 * 7), False,
None, None, {})
cj.set_cookie(c)
else:
print 'Session filename does not exist:', session_filename
return cj
def find_ff_sessions():
paths = [
'~/.mozilla/firefox/*.default',
'~/Library/Application Support/Firefox/Profiles/*.default',
'%APPDATA%/Roaming/Mozilla/Firefox/Profiles/*.default'
]
for path in paths:
filename = os.path.join(path, 'sessionstore.js')
matches = glob.glob(os.path.expanduser(filename))
if matches:
return matches[0]
def main():
login_firefox()
if __name__ == '__main__':
main()
3.使用高級模塊Mechanize自動化處理表單提交
使用Mechanize模塊可以簡化表單提交,先如下安裝:pip install mechanize
3.1用高級模塊Mechanize自動化處理表單提交併支持登錄後網頁內容更新
# -*- coding: utf-8 -*-
import mechanize
import login
#COUNTRY_URL = 'http://example.webscraping.com/edit/United-Kingdom-239'
COUNTRY_URL = 'http://127.0.0.1:8000/places/default/edit/China-47'
def mechanize_edit():
"""Use mechanize to increment population
"""
# login
br = mechanize.Browser()
br.open(login.LOGIN_URL)
br.select_form(nr=0)
print br.form
br['email'] = login.LOGIN_EMAIL
br['password'] = login.LOGIN_PASSWORD
response = br.submit()
# edit country
br.open(COUNTRY_URL)
br.select_form(nr=0)
print 'Population before:', br['population']
br['population'] = str(int(br['population']) + 1)
br.submit()
# check population increased
br.open(COUNTRY_URL)
br.select_form(nr=0)
print 'Population after:', br['population']
if __name__ == '__main__':
mechanize_edit()
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表單交互$ python 3mechanize_edit.py
<POST http://127.0.0.1:8000/places/default/user/login# application/x-www-form-urlencoded
<TextControl(email=)>
<PasswordControl(password=)>
<CheckboxControl(remember_me=[on])>
<SubmitControl(<None>=Log In) (readonly)>
<SubmitButtonControl(<None>=) (readonly)>
<HiddenControl(_next=/places/default/index) (readonly)>
<HiddenControl(_formkey=72282515-8f0d-4af1-9500-f7ac6f0526a4) (readonly)>
<HiddenControl(_formname=login) (readonly)>>
Population before: 1330044000
Population after: 1330044001
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表單交互$
文檔:http://www.search.sourceforge.net/mechanize/
3.2用普通方法支持登錄後網頁內容更新
# -*- coding: utf-8 -*-
import urllib
import urllib2
import login
#COUNTRY_URL = 'http://example.webscraping.com/edit/United-Kingdom-239'
COUNTRY_URL = 'http://127.0.0.1:8000/places/default/edit/China-47'
def edit_country():
opener = login.login_cookies()
country_html = opener.open(COUNTRY_URL).read()
data = login.parse_form(country_html)
import pprint; pprint.pprint(data)
print 'Population before: ' + data['population']
data['population'] = int(data['population']) + 1
encoded_data = urllib.urlencode(data)
request = urllib2.Request(COUNTRY_URL, encoded_data)
response = opener.open(request)
country_html = opener.open(COUNTRY_URL).read()
data = login.parse_form(country_html)
print 'Population after:', data['population']
if __name__ == '__main__':
edit_country()
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表單交互$ python 3edit_country.py
http://127.0.0.1:8000/places/default/index
{'_formkey': '3773506a-ef5e-4c4a-871d-084cb8451659',
'_formname': 'places/5087',
'area': '9596960.00',
'capital': 'Beijing',
'continent': 'AS',
'country': 'China',
'currency_code': 'CNY',
'currency_name': 'Yuan Renminbi',
'id': '5087',
'iso': 'CN',
'languages': 'zh-CN,yue,wuu,dta,ug,za',
'neighbours': 'LA,BT,TJ,KZ,MN,AF,NP,MM,KG,PK,KP,RU,VN,IN',
'phone': '86',
'population': '1330044001',
'postal_code_format': '######',
'postal_code_regex': '^(\\d{6})$',
'tld': '.cn'}
Population before: 1330044001
Population after: 1330044002
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表單交互$
Wu_Being 博客聲明:本人博客歡迎轉載,請標明博客原文和原鏈接!謝謝!
【Python爬蟲系列】《【Python爬蟲6】表單交互》http://blog.csdn.net/u014134180/article/details/55507020
Python爬蟲系列的GitHub代碼文件:https://github.com/1040003585/WebScrapingWithPython