【Python爬蟲6】表單交互


嚴格來說,本篇表單交互和下一篇驗證碼處理不算是網絡爬蟲,而是廣義上的網絡機器人。使用網絡機器人可以減少提取數據時需要表單交互的一道門檻。

1.手工處理髮送POST請求提交登錄表單

我們先在示例網站手工註冊一個賬號,註冊這個賬號需要驗證碼,下一篇會介紹處理驗證碼問題。

1.1分析表單內容

我們在登錄網址http://127.0.0.1:8000/places/default/user/login 獲得如下表單。在下面登錄表單中包括幾個重要的組成部分:

  • form標籤的action屬性:用於設置表單數據提交的地址,本例中爲#,也就是和登錄表單同一個URL;
  • form標籤的enctype屬性:用於設置數據提交的編碼,本例中爲application/x-www-form-urlencoded,表示所有非字母數字的字符都需要轉換爲十六進制的ASCII值;上傳二進程文件最好用multipart/form-data編碼類型,這種編碼不會對輸入進行編碼從而不會影響效率,而是使用MIME協議將其作爲多個部分進行發送,和郵件的傳輸標準相同。文檔:http://www.w3.org/TR/html5/forms.html#selecting-a-form-submission-encoding
  • form標籤的method屬性:本例中post表示通過請求體向服務器提交表單數據;
  • imput標籤的name屬性:用於設定提交到服務器端時某個域的名稱。
<form action="#" enctype="application/x-www-form-urlencoded" method="post">
	<table>
		<tr id="auth_user_email__row">
			<td class="w2p_fl"><label class="" for="auth_user_email" id="auth_user_email__label">E-mail: </label></td>
			<td class="w2p_fw"><input class="string" id="auth_user_email" name="email" type="text" value="" /></td>
			<td class="w2p_fc"></td>
		</tr>
		<tr id="auth_user_password__row">
			<td class="w2p_fl"><label class="" for="auth_user_password" id="auth_user_password__label">Password: </label></td>
			<td class="w2p_fw"><input class="password" id="auth_user_password" name="password" type="password" value="" /></td>
			<td class="w2p_fc"></td>
		</tr>
		<tr id="auth_user_remember_me__row">
			<td class="w2p_fl"><label class="" for="auth_user_remember_me" id="auth_user_remember_me__label">Remember me (for 30 days): </label></td>
			<td class="w2p_fw"><input class="boolean" id="auth_user_remember_me" name="remember_me" type="checkbox" value="on" /></td>
			<td class="w2p_fc"></td>
		</tr>
		<tr id="submit_record__row">
			<td class="w2p_fl"></td><td class="w2p_fw">
				<input type="submit" value="Log In" />
				<button class="btn w2p-form-button" οnclick="window.location=&#x27;/places/default/user/register&#x27;;return false">Register</button>
			</td>
			<td class="w2p_fc"></td>
		</tr>
	</table>
	<div style="display:none;">
		<input name="_next" type="hidden" value="/places/default/index" />
		<input name="_formkey" type="hidden" value="7b1add4b-fa91-4301-975e-b6fbf7def3ac" />
		<input name="_formname" type="hidden" value="login" />
	</div>
</form>

1.2手工測試post請求提交表單

如果登錄成功則跳到主頁,否則回到登錄頁。下面是嘗試自動登錄的初始版本代碼。顯然登錄失敗!

>>> import urllib,urllib2
>>> LOGIN_URL='http://127.0.0.1:8000/places/default/user/login'
>>> LOGIN_EMAIL='[email protected]'
>>> LOGIN_PASSWORD='wu.com'
>>> data={'email':LOGIN_EMAIL,'password':LOGIN_PASSWORD}
>>> encoded_data=urllib.urlencode(data)
>>> request=urllib2.Request(LOGIN_URL,encoded_data)
>>> response=urllib2.urlopen(request)
>>> response.geturl()
'http://127.0.0.1:8000/places/default/user/login'
>>> 

因爲登錄時還需要添加隱藏的_formkey屬性,這個唯一的ID用來避免表單多次提交。每次加載網頁時,都會產生不同的ID,然後服務器端就可以通過這個給定的ID來判斷表單是否已經通過提交過。下面是獲得該屬性值:

>>> 
>>> import lxml.html
>>> def parse_form(html):
...     tree=lxml.html.fromstring(html)
...     data={}
...     for e in tree.cssselect('form input'):
...             if e.get('name'):
...                     data[e.get('name')]=e.get('value')
...     return data
... 
>>> import pprint
>>> html=urllib2.urlopen(LOGIN_URL).read()
>>> form=parse_form(html)
>>> pprint.pprint(form)
{'_formkey': '437e4660-0c44-4187-af8d-36487c62ffce',
 '_formname': 'login',
 '_next': '/places/default/index',
 'email': '',
 'password': '',
 'remember_me': 'on'}
>>> 

下面是通過_formkey和其他隱藏域的新版本自動登錄代碼。發現還是不成功!

>>> 
>>> html=urllib2.urlopen(LOGIN_URL).read()
>>> data=parse_form(html)
>>> data['email']=LOGIN_EMAIL
>>> data['password']=LOGIN_PASSWORD
>>> encoded_data=urllib.urlencode(data)
>>> request=urllib2.Request(LOGIN_URL,encoded_data)
>>> response=urllib2.urlopen(request)
>>> response.geturl()
'http://127.0.0.1:8000/places/default/user/login'
>>> 

因爲我們缺失了一個重要的組成部分——cookie。當普通用戶加載登錄表單時,_formkey的值將會保存在cookie中,然後該值會與提交的登錄表單數據中的_formkey的值進行對比。下面是使用urllib2.HTTPCookieProcessor類增加了cookie支持之後的代碼。最後登錄成功了!

>>> 
>>> import cookielib
>>> cj=cookielib.CookieJar()
>>> opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
>>> 
>>> html=opener.open(LOGIN_URL).read()		#opener
>>> data=parse_form(html)
>>> data['email']=LOGIN_EMAIL
>>> data['password']=LOGIN_PASSWORD
>>> encoded_data=urllib.urlencode(data)
>>> request=urllib2.Request(LOGIN_URL,encoded_data)
>>> response=opener.open(request)		#opener
>>> response.geturl()
'http://127.0.0.1:8000/places/default/index'
>>> 

1.3手工處理post請求登錄的完整源代碼:

# -*- coding: utf-8 -*-
import urllib
import urllib2
import cookielib
import lxml.html

LOGIN_EMAIL = '[email protected]'
LOGIN_PASSWORD = 'wu.com'
#LOGIN_URL = 'http://example.webscraping.com/user/login'
LOGIN_URL = 'http://127.0.0.1:8000/places/default/user/login'


def login_basic():
    """fails because not using formkey
    """
    data = {'email': LOGIN_EMAIL, 'password': LOGIN_PASSWORD}
    encoded_data = urllib.urlencode(data)
    request = urllib2.Request(LOGIN_URL, encoded_data)
    response = urllib2.urlopen(request)
    print response.geturl()

def login_formkey():
    """fails because not using cookies to match formkey
    """
    html = urllib2.urlopen(LOGIN_URL).read()
    data = parse_form(html)
    data['email'] = LOGIN_EMAIL
    data['password'] = LOGIN_PASSWORD
    encoded_data = urllib.urlencode(data)
    request = urllib2.Request(LOGIN_URL, encoded_data)
    response = urllib2.urlopen(request)
    print response.geturl()

def login_cookies():
    """working login
    """
    cj = cookielib.CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    html = opener.open(LOGIN_URL).read()
    data = parse_form(html)
    data['email'] = LOGIN_EMAIL
    data['password'] = LOGIN_PASSWORD
    encoded_data = urllib.urlencode(data)
    request = urllib2.Request(LOGIN_URL, encoded_data)
    response = opener.open(request)
    print response.geturl()
    return opener

def parse_form(html):
    """extract all input properties from the form
    """
    tree = lxml.html.fromstring(html)
    data = {}
    for e in tree.cssselect('form input'):
        if e.get('name'):
            data[e.get('name')] = e.get('value')
    return data

def main():
    #login_basic()
    #login_formkey()
    login_cookies()

if __name__ == '__main__':
    main()

2.從FF瀏覽器加載cookie登錄網站

我們先用手工執行登錄,我們先在FF瀏覽器用手工執行登錄,然後關閉FF瀏覽器,然後用python腳本複用之前得到的cookie,從而實現自動登錄。

2.1session文件位置

FireFox在sqlist數據庫中存儲cookie,在json文件中存儲session,這兩種存儲方式都可以直接通過Python獲取。對於登錄操作而言,我們只需要獲致session即可。對於不同的操作系統,FireFox存儲的session文件的位置不同:

  • Linux系統:~/.mozilla/firefox/*.default/sessionstore.js
  • OS X系統:~/Library/Application Support/Firefox/Profiles/*.default/sessionstore.js
  • Windows Vista及以上版本系統:%APPDATA%/Roaming/Mozilla/Firefox/Profiles/*.default/sessionstore.js

下面是返回session文件路徑的輔助函數代碼:

def find_ff_sessions():
    paths = [
        '~/.mozilla/firefox/*.default',
        '~/Library/Application Support/Firefox/Profiles/*.default',
        '%APPDATA%/Roaming/Mozilla/Firefox/Profiles/*.default'
    ]
    for path in paths:
        filename = os.path.join(path, 'sessionstore.js')
        matches = glob.glob(os.path.expanduser(filename))
        if matches:
            return matches[0]

注:glob模塊會返回指定路徑中所有匹配的文件。

2.2FF瀏覽器cookie內容

下面是Linux系統火狐瀏覽器session文件內容:

wu_being@ubuntukylin64:~/.mozilla/firefox/78n340f7.default$ ls
addons.json           datareporting       key3.db             prefs.js                      storage
blocklist.xml         extensions          logins.json         revocations.txt               storage.sqlite
bookmarkbackups       extensions.ini      mimeTypes.rdf       saved-telemetry-pings         times.json
cert8.db              extensions.json     minidumps           search.json.mozlz4            webapps
compatibility.ini     features            permissions.sqlite  secmod.db                     webappsstore.sqlite
containers.json       formhistory.sqlite  places.sqlite       sessionCheckpoints.json       xulstore.json
content-prefs.sqlite  gmp                 places.sqlite-shm   sessionstore-backups
cookies.sqlite        gmp-gmpopenh264     places.sqlite-wal   sessionstore.js
crashes               healthreport        pluginreg.dat       SiteSecurityServiceState.txt
wu_being@ubuntukylin64:~/.mozilla/firefox/78n340f7.default$ more sessionstore.js 
{"version":["sessionrestore",1],
"windows":[{
	...
	"cookies":[
		{"host":"127.0.0.1",
		"value":"127.0.0.1-aabe0222-d083-44ee-94c8-e9343eefb2e5",
		"path":"/",
		"name":"session_id_welcome",
		"httponly":true,
		"originAttributes":{"addonId":"","appId":0,"inIsolatedMozBrowser":false,"privateBrowsingId":0,"signedPkg":"","userContextId":0}},
		{"host":"127.0.0.1",
		"value":"True",
		"path":"/",
		"name":"session_id_places",
		"httponly":true,
		"originAttributes":{"addonId":"","appId":0,"inIsolatedMozBrowser":false,"privateBrowsingId":0,"signedPkg":"","userContextId":0}},
		{"host":"127.0.0.1",
		"value":"\":oJoAPvH-ODMFDXwk3U...su0Dxr7doAgu9yQiSEmgQiSy98Ga7C6K2tIQoZwzY0_4wBO0qHm-FlcBf-cPRk7GPAhix8yS4roOVIvMqP5I7ZB_uIA==\"",
		"path":"/",
		"name":"session_data_places",
		"originAttributes":{"addonId":"","appId":0,"inIsolatedMozBrowser":false,"privateBrowsingId":0,"signedPkg":"","userContextId":0}}
	],
	"title":"Example web scraping website",
	"_shouldRestore":true,
	"closedAt":1485228738310
}],
"selectedWindow":0,
"_closedWindows":[],
"session":{"lastUpdate":1485228738927,"startTime":1485226675190,"recentCrashes":0},
"global":{}
}

wu_being@ubuntukylin64:~/.mozilla/firefox/78n340f7.default$ 

根據seesion存儲結構,我們用下面代碼把session解析到CookieJar對象中。

def load_ff_sessions(session_filename):
    cj = cookielib.CookieJar()
    if os.path.exists(session_filename):  
        try: 
            json_data = json.loads(open(session_filename, 'rb').read())
        except ValueError as e:
            print 'Error parsing session JSON:', str(e)
        else:
            for window in json_data.get('windows', []):
                for cookie in window.get('cookies', []):
                    import pprint; pprint.pprint(cookie)
                    c = cookielib.Cookie(0, cookie.get('name', ''), cookie.get('value', ''), 
                        None, False, 
                        cookie.get('host', ''), cookie.get('host', '').startswith('.'), cookie.get('host', '').startswith('.'), 
                        cookie.get('path', ''), False,
                        False, str(int(time.time()) + 3600 * 24 * 7), False, 
                        None, None, {})
                    cj.set_cookie(c)
    else:
        print 'Session filename does not exist:', session_filename
    return cj

2.3使用cookie測試加載登錄

session_filename = find_ff_sessions()
cj = load_ff_sessions(session_filename)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
html = opener.open(COUNTRY_URL).read()

tree = lxml.html.fromstring(html)
print tree.cssselect('ul#navbar li a')[0].text_content()

如果得到的結果是Login則說明沒能正確加載。如果出現這樣情況,你就需要確認一下FireFox中是否已經成功登錄救命網站。如果得到下面結果,有Welcome 用戶的first name,則登錄表示成功。

wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表單交互$ python 2login_firefox.py 
{u'host': u'127.0.0.1',
 u'httponly': True,
 u'name': u'session_id_welcome',
 u'originAttributes': {u'addonId': u'',
                       u'appId': 0,
                       u'inIsolatedMozBrowser': False,
                       u'privateBrowsingId': 0,
                       u'signedPkg': u'',
                       u'userContextId': 0},
 u'path': u'/',
 u'value': u'127.0.0.1-406df419-ed33-4de5-bc46-cd2d9f3c431b'}
Log In
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表單交互$ 
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表單交互$ python 2login_firefox.py 
{u'host': u'127.0.0.1',
 u'httponly': True,
 u'name': u'session_id_welcome',
 u'originAttributes': {u'addonId': u'',
                       u'appId': 0,
                       u'inIsolatedMozBrowser': False,
                       u'privateBrowsingId': 0,
                       u'signedPkg': u'',
                       u'userContextId': 0},
 u'path': u'/',
 u'value': u'127.0.0.1-aabe0222-d083-44ee-94c8-e9343eefb2e5'}
{u'host': u'127.0.0.1',
 u'httponly': True,
 u'name': u'session_id_places',
 u'originAttributes': {u'addonId': u'',
                       u'appId': 0,
                       u'inIsolatedMozBrowser': False,
                       u'privateBrowsingId': 0,
                       u'signedPkg': u'',
                       u'userContextId': 0},
 u'path': u'/',
 u'value': u'True'}
{u'host': u'127.0.0.1',
 u'name': u'session_data_places',
 u'originAttributes': {u'addonId': u'',
                       u'appId': 0,
                       u'inIsolatedMozBrowser': False,
                       u'privateBrowsingId': 0,
                       u'signedPkg': u'',
                       u'userContextId': 0},
 u'path': u'/',
 u'value': u'"ef34329782d4efe136522cb44fc4bd21:oJoAPvH-ODM...QiSy98Ga7C6K2tIQoZwzY0_4wBO0qHm-FlcBf-cPRk7GPAhix8yS4roOVIvMqP5I7ZB_uIA=="'}
Welcome Wu
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表單交互$ 

如果你想從其他瀏覽器的cookie,可以使用browsercookie模塊。通過pip install browsercookie命令進行安裝,文檔:https://pypi.python.org/pypi/browsercookie

2.4使用cookie登錄源代碼

# -*- coding: utf-8 -*-
import urllib2
import glob
import os
import cookielib
import json
import time
import lxml.html

COUNTRY_URL = 'http://127.0.0.1:8000/places/default/edit/China-47'

def login_firefox():
    """load cookies from firefox
    """
    session_filename = find_ff_sessions()
    cj = load_ff_sessions(session_filename)
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    html = opener.open(COUNTRY_URL).read()

    tree = lxml.html.fromstring(html)
    print tree.cssselect('ul#navbar li a')[0].text_content()
    return opener

def load_ff_sessions(session_filename):
    cj = cookielib.CookieJar()
    if os.path.exists(session_filename):  
        try: 
            json_data = json.loads(open(session_filename, 'rb').read())
        except ValueError as e:
            print 'Error parsing session JSON:', str(e)
        else:
            for window in json_data.get('windows', []):
                for cookie in window.get('cookies', []):
                    import pprint; pprint.pprint(cookie)
                    c = cookielib.Cookie(0, cookie.get('name', ''), cookie.get('value', ''), 
                        None, False, 
                        cookie.get('host', ''), cookie.get('host', '').startswith('.'), cookie.get('host', '').startswith('.'), 
                        cookie.get('path', ''), False,
                        False, str(int(time.time()) + 3600 * 24 * 7), False, 
                        None, None, {})
                    cj.set_cookie(c)
    else:
        print 'Session filename does not exist:', session_filename
    return cj

def find_ff_sessions():
    paths = [
        '~/.mozilla/firefox/*.default',
        '~/Library/Application Support/Firefox/Profiles/*.default',
        '%APPDATA%/Roaming/Mozilla/Firefox/Profiles/*.default'
    ]
    for path in paths:
        filename = os.path.join(path, 'sessionstore.js')
        matches = glob.glob(os.path.expanduser(filename))
        if matches:
            return matches[0]

def main():
    login_firefox()

if __name__ == '__main__':
    main()

3.使用高級模塊Mechanize自動化處理表單提交

使用Mechanize模塊可以簡化表單提交,先如下安裝:pip install mechanize

3.1用高級模塊Mechanize自動化處理表單提交併支持登錄後網頁內容更新

# -*- coding: utf-8 -*-
import mechanize
import login

#COUNTRY_URL = 'http://example.webscraping.com/edit/United-Kingdom-239'
COUNTRY_URL = 'http://127.0.0.1:8000/places/default/edit/China-47'

def mechanize_edit():
    """Use mechanize to increment population
    """
    # login
    br = mechanize.Browser()
    br.open(login.LOGIN_URL)
    br.select_form(nr=0)
    print br.form
    br['email'] = login.LOGIN_EMAIL
    br['password'] = login.LOGIN_PASSWORD
    response = br.submit()

    # edit country
    br.open(COUNTRY_URL)
    br.select_form(nr=0)
    print 'Population before:', br['population']
    br['population'] = str(int(br['population']) + 1)
    br.submit()

    # check population increased
    br.open(COUNTRY_URL)
    br.select_form(nr=0)
    print 'Population after:', br['population']

if __name__ == '__main__':
    mechanize_edit()
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表單交互$ python 3mechanize_edit.py 
<POST http://127.0.0.1:8000/places/default/user/login# application/x-www-form-urlencoded
  <TextControl(email=)>
  <PasswordControl(password=)>
  <CheckboxControl(remember_me=[on])>
  <SubmitControl(<None>=Log In) (readonly)>
  <SubmitButtonControl(<None>=) (readonly)>
  <HiddenControl(_next=/places/default/index) (readonly)>
  <HiddenControl(_formkey=72282515-8f0d-4af1-9500-f7ac6f0526a4) (readonly)>
  <HiddenControl(_formname=login) (readonly)>>
Population before: 1330044000
Population after: 1330044001
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表單交互$ 

文檔:http://www.search.sourceforge.net/mechanize/

3.2用普通方法支持登錄後網頁內容更新

# -*- coding: utf-8 -*-
import urllib
import urllib2
import login

#COUNTRY_URL = 'http://example.webscraping.com/edit/United-Kingdom-239'
COUNTRY_URL = 'http://127.0.0.1:8000/places/default/edit/China-47'
	
def edit_country():
    opener = login.login_cookies()
    country_html = opener.open(COUNTRY_URL).read()
    data = login.parse_form(country_html)
    import pprint; pprint.pprint(data)
    print 'Population before: ' + data['population']
    data['population'] = int(data['population']) + 1
    encoded_data = urllib.urlencode(data)
    request = urllib2.Request(COUNTRY_URL, encoded_data)
    response = opener.open(request)

    country_html = opener.open(COUNTRY_URL).read()
    data = login.parse_form(country_html)
    print 'Population after:', data['population']

if __name__ == '__main__':
    edit_country()
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表單交互$ python 3edit_country.py 
http://127.0.0.1:8000/places/default/index
{'_formkey': '3773506a-ef5e-4c4a-871d-084cb8451659',
 '_formname': 'places/5087',
 'area': '9596960.00',
 'capital': 'Beijing',
 'continent': 'AS',
 'country': 'China',
 'currency_code': 'CNY',
 'currency_name': 'Yuan Renminbi',
 'id': '5087',
 'iso': 'CN',
 'languages': 'zh-CN,yue,wuu,dta,ug,za',
 'neighbours': 'LA,BT,TJ,KZ,MN,AF,NP,MM,KG,PK,KP,RU,VN,IN',
 'phone': '86',
 'population': '1330044001',
 'postal_code_format': '######',
 'postal_code_regex': '^(\\d{6})$',
 'tld': '.cn'}
Population before: 1330044001
Population after: 1330044002
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表單交互$ 

Wu_Being 博客聲明:本人博客歡迎轉載,請標明博客原文和原鏈接!謝謝!
【Python爬蟲系列】《【Python爬蟲6】表單交互》http://blog.csdn.net/u014134180/article/details/55507020
Python爬蟲系列的GitHub代碼文件https://github.com/1040003585/WebScrapingWithPython

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章