文章目錄

嚴格來說，本篇表單交互和下一篇驗證碼處理不算是網絡爬蟲，而是廣義上的網絡機器人。使用網絡機器人可以減少提取數據時需要表單交互的一道門檻。

1.手工處理髮送POST請求提交登錄表單

我們先在示例網站手工註冊一個賬號，註冊這個賬號需要驗證碼，下一篇會介紹處理驗證碼問題。

1.1分析表單內容

我們在登錄網址http://127.0.0.1:8000/places/default/user/login 獲得如下表單。在下面登錄表單中包括幾個重要的組成部分：

form標籤的action屬性：用於設置表單數據提交的地址，本例中爲#，也就是和登錄表單同一個URL；
form標籤的enctype屬性：用於設置數據提交的編碼，本例中爲application/x-www-form-urlencoded，表示所有非字母數字的字符都需要轉換爲十六進制的ASCII值；上傳二進程文件最好用multipart/form-data編碼類型，這種編碼不會對輸入進行編碼從而不會影響效率，而是使用MIME協議將其作爲多個部分進行發送，和郵件的傳輸標準相同。文檔：http://www.w3.org/TR/html5/forms.html#selecting-a-form-submission-encoding
form標籤的method屬性：本例中post表示通過請求體向服務器提交表單數據；
imput標籤的name屬性：用於設定提交到服務器端時某個域的名稱。

<form action="#" enctype="application/x-www-form-urlencoded" method="post">
	<table>
		<tr id="auth_user_email__row">
			<td class="w2p_fl"><label class="" for="auth_user_email" id="auth_user_email__label">E-mail: </label></td>
			<td class="w2p_fw"><input class="string" id="auth_user_email" name="email" type="text" value="" /></td>
			<td class="w2p_fc"></td>
		</tr>
		<tr id="auth_user_password__row">
			<td class="w2p_fl"><label class="" for="auth_user_password" id="auth_user_password__label">Password: </label></td>
			<td class="w2p_fw"><input class="password" id="auth_user_password" name="password" type="password" value="" /></td>
			<td class="w2p_fc"></td>
		</tr>
		<tr id="auth_user_remember_me__row">
			<td class="w2p_fl"><label class="" for="auth_user_remember_me" id="auth_user_remember_me__label">Remember me (for 30 days): </label></td>
			<td class="w2p_fw"><input class="boolean" id="auth_user_remember_me" name="remember_me" type="checkbox" value="on" /></td>
			<td class="w2p_fc"></td>
		</tr>
		<tr id="submit_record__row">
			<td class="w2p_fl"></td><td class="w2p_fw">
				<input type="submit" value="Log In" />
				<button class="btn w2p-form-button" οnclick="window.location=&#x27;/places/default/user/register&#x27;;return false">Register</button>
			</td>
			<td class="w2p_fc"></td>
		</tr>
	</table>
	<div style="display:none;">
		<input name="_next" type="hidden" value="/places/default/index" />
		<input name="_formkey" type="hidden" value="7b1add4b-fa91-4301-975e-b6fbf7def3ac" />
		<input name="_formname" type="hidden" value="login" />
	</div>
</form>

1.2手工測試post請求提交表單

如果登錄成功則跳到主頁，否則回到登錄頁。下面是嘗試自動登錄的初始版本代碼。顯然登錄失敗！

>>> import urllib,urllib2
>>> LOGIN_URL='http://127.0.0.1:8000/places/default/user/login'
>>> LOGIN_EMAIL='[email protected]'
>>> LOGIN_PASSWORD='wu.com'
>>> data={'email':LOGIN_EMAIL,'password':LOGIN_PASSWORD}
>>> encoded_data=urllib.urlencode(data)
>>> request=urllib2.Request(LOGIN_URL,encoded_data)
>>> response=urllib2.urlopen(request)
>>> response.geturl()
'http://127.0.0.1:8000/places/default/user/login'
>>>

因爲登錄時還需要添加隱藏的_formkey屬性，這個唯一的ID用來避免表單多次提交。每次加載網頁時，都會產生不同的ID，然後服務器端就可以通過這個給定的ID來判斷表單是否已經通過提交過。下面是獲得該屬性值：

>>> 
>>> import lxml.html
>>> def parse_form(html):
...     tree=lxml.html.fromstring(html)
...     data={}
...     for e in tree.cssselect('form input'):
...             if e.get('name'):
...                     data[e.get('name')]=e.get('value')
...     return data
... 
>>> import pprint
>>> html=urllib2.urlopen(LOGIN_URL).read()
>>> form=parse_form(html)
>>> pprint.pprint(form)
{'_formkey': '437e4660-0c44-4187-af8d-36487c62ffce',
 '_formname': 'login',
 '_next': '/places/default/index',
 'email': '',
 'password': '',
 'remember_me': 'on'}
>>>

下面是通過_formkey和其他隱藏域的新版本自動登錄代碼。發現還是不成功！

>>> 
>>> html=urllib2.urlopen(LOGIN_URL).read()
>>> data=parse_form(html)
>>> data['email']=LOGIN_EMAIL
>>> data['password']=LOGIN_PASSWORD
>>> encoded_data=urllib.urlencode(data)
>>> request=urllib2.Request(LOGIN_URL,encoded_data)
>>> response=urllib2.urlopen(request)
>>> response.geturl()
'http://127.0.0.1:8000/places/default/user/login'
>>>

因爲我們缺失了一個重要的組成部分——cookie。當普通用戶加載登錄表單時，_formkey的值將會保存在cookie中，然後該值會與提交的登錄表單數據中的_formkey的值進行對比。下面是使用urllib2.HTTPCookieProcessor類增加了cookie支持之後的代碼。最後登錄成功了！

>>> 
>>> import cookielib
>>> cj=cookielib.CookieJar()
>>> opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
>>> 
>>> html=opener.open(LOGIN_URL).read()		#opener
>>> data=parse_form(html)
>>> data['email']=LOGIN_EMAIL
>>> data['password']=LOGIN_PASSWORD
>>> encoded_data=urllib.urlencode(data)
>>> request=urllib2.Request(LOGIN_URL,encoded_data)
>>> response=opener.open(request)		#opener
>>> response.geturl()
'http://127.0.0.1:8000/places/default/index'
>>>

1.3手工處理post請求登錄的完整源代碼：

# -*- coding: utf-8 -*-
import urllib
import urllib2
import cookielib
import lxml.html

LOGIN_EMAIL = '[email protected]'
LOGIN_PASSWORD = 'wu.com'
#LOGIN_URL = 'http://example.webscraping.com/user/login'
LOGIN_URL = 'http://127.0.0.1:8000/places/default/user/login'


def login_basic():
    """fails because not using formkey
    """
    data = {'email': LOGIN_EMAIL, 'password': LOGIN_PASSWORD}
    encoded_data = urllib.urlencode(data)
    request = urllib2.Request(LOGIN_URL, encoded_data)
    response = urllib2.urlopen(request)
    print response.geturl()

def login_formkey():
    """fails because not using cookies to match formkey
    """
    html = urllib2.urlopen(LOGIN_URL).read()
    data = parse_form(html)
    data['email'] = LOGIN_EMAIL
    data['password'] = LOGIN_PASSWORD
    encoded_data = urllib.urlencode(data)
    request = urllib2.Request(LOGIN_URL, encoded_data)
    response = urllib2.urlopen(request)
    print response.geturl()

def login_cookies():
    """working login
    """
    cj = cookielib.CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    html = opener.open(LOGIN_URL).read()
    data = parse_form(html)
    data['email'] = LOGIN_EMAIL
    data['password'] = LOGIN_PASSWORD
    encoded_data = urllib.urlencode(data)
    request = urllib2.Request(LOGIN_URL, encoded_data)
    response = opener.open(request)
    print response.geturl()
    return opener

def parse_form(html):
    """extract all input properties from the form
    """
    tree = lxml.html.fromstring(html)
    data = {}
    for e in tree.cssselect('form input'):
        if e.get('name'):
            data[e.get('name')] = e.get('value')
    return data

def main():
    #login_basic()
    #login_formkey()
    login_cookies()

if __name__ == '__main__':
    main()

2.從FF瀏覽器加載cookie登錄網站

我們先用手工執行登錄，我們先在FF瀏覽器用手工執行登錄，然後關閉FF瀏覽器，然後用python腳本複用之前得到的cookie，從而實現自動登錄。

2.1session文件位置

FireFox在sqlist數據庫中存儲cookie，在json文件中存儲session，這兩種存儲方式都可以直接通過Python獲取。對於登錄操作而言，我們只需要獲致session即可。對於不同的操作系統，FireFox存儲的session文件的位置不同：

Linux系統：~/.mozilla/firefox/*.default/sessionstore.js
OS X系統：~/Library/Application Support/Firefox/Profiles/*.default/sessionstore.js
Windows Vista及以上版本系統：%APPDATA%/Roaming/Mozilla/Firefox/Profiles/*.default/sessionstore.js

下面是返回session文件路徑的輔助函數代碼：

def find_ff_sessions():
    paths = [
        '~/.mozilla/firefox/*.default',
        '~/Library/Application Support/Firefox/Profiles/*.default',
        '%APPDATA%/Roaming/Mozilla/Firefox/Profiles/*.default'
    ]
    for path in paths:
        filename = os.path.join(path, 'sessionstore.js')
        matches = glob.glob(os.path.expanduser(filename))
        if matches:
            return matches[0]

注：glob模塊會返回指定路徑中所有匹配的文件。

2.2FF瀏覽器cookie內容

下面是Linux系統火狐瀏覽器session文件內容：

wu_being@ubuntukylin64:~/.mozilla/firefox/78n340f7.default$ ls
addons.json           datareporting       key3.db             prefs.js                      storage
blocklist.xml         extensions          logins.json         revocations.txt               storage.sqlite
bookmarkbackups       extensions.ini      mimeTypes.rdf       saved-telemetry-pings         times.json
cert8.db              extensions.json     minidumps           search.json.mozlz4            webapps
compatibility.ini     features            permissions.sqlite  secmod.db                     webappsstore.sqlite
containers.json       formhistory.sqlite  places.sqlite       sessionCheckpoints.json       xulstore.json
content-prefs.sqlite  gmp                 places.sqlite-shm   sessionstore-backups
cookies.sqlite        gmp-gmpopenh264     places.sqlite-wal   sessionstore.js
crashes               healthreport        pluginreg.dat       SiteSecurityServiceState.txt
wu_being@ubuntukylin64:~/.mozilla/firefox/78n340f7.default$ more sessionstore.js 
{"version":["sessionrestore",1],
"windows":[{
	...
	"cookies":[
		{"host":"127.0.0.1",
		"value":"127.0.0.1-aabe0222-d083-44ee-94c8-e9343eefb2e5",
		"path":"/",
		"name":"session_id_welcome",
		"httponly":true,
		"originAttributes":{"addonId":"","appId":0,"inIsolatedMozBrowser":false,"privateBrowsingId":0,"signedPkg":"","userContextId":0}},
		{"host":"127.0.0.1",
		"value":"True",
		"path":"/",
		"name":"session_id_places",
		"httponly":true,
		"originAttributes":{"addonId":"","appId":0,"inIsolatedMozBrowser":false,"privateBrowsingId":0,"signedPkg":"","userContextId":0}},
		{"host":"127.0.0.1",
		"value":"\":oJoAPvH-ODMFDXwk3U...su0Dxr7doAgu9yQiSEmgQiSy98Ga7C6K2tIQoZwzY0_4wBO0qHm-FlcBf-cPRk7GPAhix8yS4roOVIvMqP5I7ZB_uIA==\"",
		"path":"/",
		"name":"session_data_places",
		"originAttributes":{"addonId":"","appId":0,"inIsolatedMozBrowser":false,"privateBrowsingId":0,"signedPkg":"","userContextId":0}}
	],
	"title":"Example web scraping website",
	"_shouldRestore":true,
	"closedAt":1485228738310
}],
"selectedWindow":0,
"_closedWindows":[],
"session":{"lastUpdate":1485228738927,"startTime":1485226675190,"recentCrashes":0},
"global":{}
}

wu_being@ubuntukylin64:~/.mozilla/firefox/78n340f7.default$

根據seesion存儲結構，我們用下面代碼把session解析到CookieJar對象中。

def load_ff_sessions(session_filename):
    cj = cookielib.CookieJar()
    if os.path.exists(session_filename):  
        try: 
            json_data = json.loads(open(session_filename, 'rb').read())
        except ValueError as e:
            print 'Error parsing session JSON:', str(e)
        else:
            for window in json_data.get('windows', []):
                for cookie in window.get('cookies', []):
                    import pprint; pprint.pprint(cookie)
                    c = cookielib.Cookie(0, cookie.get('name', ''), cookie.get('value', ''), 
                        None, False, 
                        cookie.get('host', ''), cookie.get('host', '').startswith('.'), cookie.get('host', '').startswith('.'), 
                        cookie.get('path', ''), False,
                        False, str(int(time.time()) + 3600 * 24 * 7), False, 
                        None, None, {})
                    cj.set_cookie(c)
    else:
        print 'Session filename does not exist:', session_filename
    return cj

2.3使用cookie測試加載登錄

session_filename = find_ff_sessions()
cj = load_ff_sessions(session_filename)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
html = opener.open(COUNTRY_URL).read()

tree = lxml.html.fromstring(html)
print tree.cssselect('ul#navbar li a')[0].text_content()

如果得到的結果是Login則說明沒能正確加載。如果出現這樣情況，你就需要確認一下FireFox中是否已經成功登錄救命網站。如果得到下面結果，有Welcome 用戶的first name，則登錄表示成功。

wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表單交互$ python 2login_firefox.py 
{u'host': u'127.0.0.1',
 u'httponly': True,
 u'name': u'session_id_welcome',
 u'originAttributes': {u'addonId': u'',
                       u'appId': 0,
                       u'inIsolatedMozBrowser': False,
                       u'privateBrowsingId': 0,
                       u'signedPkg': u'',
                       u'userContextId': 0},
 u'path': u'/',
 u'value': u'127.0.0.1-406df419-ed33-4de5-bc46-cd2d9f3c431b'}
Log In
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表單交互$

wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表單交互$ python 2login_firefox.py 
{u'host': u'127.0.0.1',
 u'httponly': True,
 u'name': u'session_id_welcome',
 u'originAttributes': {u'addonId': u'',
                       u'appId': 0,
                       u'inIsolatedMozBrowser': False,
                       u'privateBrowsingId': 0,
                       u'signedPkg': u'',
                       u'userContextId': 0},
 u'path': u'/',
 u'value': u'127.0.0.1-aabe0222-d083-44ee-94c8-e9343eefb2e5'}
{u'host': u'127.0.0.1',
 u'httponly': True,
 u'name': u'session_id_places',
 u'originAttributes': {u'addonId': u'',
                       u'appId': 0,
                       u'inIsolatedMozBrowser': False,
                       u'privateBrowsingId': 0,
                       u'signedPkg': u'',
                       u'userContextId': 0},
 u'path': u'/',
 u'value': u'True'}
{u'host': u'127.0.0.1',
 u'name': u'session_data_places',
 u'originAttributes': {u'addonId': u'',
                       u'appId': 0,
                       u'inIsolatedMozBrowser': False,
                       u'privateBrowsingId': 0,
                       u'signedPkg': u'',
                       u'userContextId': 0},
 u'path': u'/',
 u'value': u'"ef34329782d4efe136522cb44fc4bd21:oJoAPvH-ODM...QiSy98Ga7C6K2tIQoZwzY0_4wBO0qHm-FlcBf-cPRk7GPAhix8yS4roOVIvMqP5I7ZB_uIA=="'}
Welcome Wu
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表單交互$

如果你想從其他瀏覽器的cookie，可以使用browsercookie模塊。通過pip install browsercookie命令進行安裝，文檔：https://pypi.python.org/pypi/browsercookie

2.4使用cookie登錄源代碼

# -*- coding: utf-8 -*-
import urllib2
import glob
import os
import cookielib
import json
import time
import lxml.html

COUNTRY_URL = 'http://127.0.0.1:8000/places/default/edit/China-47'

def login_firefox():
    """load cookies from firefox
    """
    session_filename = find_ff_sessions()
    cj = load_ff_sessions(session_filename)
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    html = opener.open(COUNTRY_URL).read()

    tree = lxml.html.fromstring(html)
    print tree.cssselect('ul#navbar li a')[0].text_content()
    return opener

def load_ff_sessions(session_filename):
    cj = cookielib.CookieJar()
    if os.path.exists(session_filename):  
        try: 
            json_data = json.loads(open(session_filename, 'rb').read())
        except ValueError as e:
            print 'Error parsing session JSON:', str(e)
        else:
            for window in json_data.get('windows', []):
                for cookie in window.get('cookies', []):
                    import pprint; pprint.pprint(cookie)
                    c = cookielib.Cookie(0, cookie.get('name', ''), cookie.get('value', ''), 
                        None, False, 
                        cookie.get('host', ''), cookie.get('host', '').startswith('.'), cookie.get('host', '').startswith('.'), 
                        cookie.get('path', ''), False,
                        False, str(int(time.time()) + 3600 * 24 * 7), False, 
                        None, None, {})
                    cj.set_cookie(c)
    else:
        print 'Session filename does not exist:', session_filename
    return cj

def find_ff_sessions():
    paths = [
        '~/.mozilla/firefox/*.default',
        '~/Library/Application Support/Firefox/Profiles/*.default',
        '%APPDATA%/Roaming/Mozilla/Firefox/Profiles/*.default'
    ]
    for path in paths:
        filename = os.path.join(path, 'sessionstore.js')
        matches = glob.glob(os.path.expanduser(filename))
        if matches:
            return matches[0]

def main():
    login_firefox()

if __name__ == '__main__':
    main()

3.使用高級模塊Mechanize自動化處理表單提交

使用Mechanize模塊可以簡化表單提交，先如下安裝：pip install mechanize

3.1用高級模塊Mechanize自動化處理表單提交併支持登錄後網頁內容更新

# -*- coding: utf-8 -*-
import mechanize
import login

#COUNTRY_URL = 'http://example.webscraping.com/edit/United-Kingdom-239'
COUNTRY_URL = 'http://127.0.0.1:8000/places/default/edit/China-47'

def mechanize_edit():
    """Use mechanize to increment population
    """
    # login
    br = mechanize.Browser()
    br.open(login.LOGIN_URL)
    br.select_form(nr=0)
    print br.form
    br['email'] = login.LOGIN_EMAIL
    br['password'] = login.LOGIN_PASSWORD
    response = br.submit()

    # edit country
    br.open(COUNTRY_URL)
    br.select_form(nr=0)
    print 'Population before:', br['population']
    br['population'] = str(int(br['population']) + 1)
    br.submit()

    # check population increased
    br.open(COUNTRY_URL)
    br.select_form(nr=0)
    print 'Population after:', br['population']

if __name__ == '__main__':
    mechanize_edit()

wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表單交互$ python 3mechanize_edit.py 
<POST http://127.0.0.1:8000/places/default/user/login# application/x-www-form-urlencoded
  <TextControl(email=)>
  <PasswordControl(password=)>
  <CheckboxControl(remember_me=[on])>
  <SubmitControl(<None>=Log In) (readonly)>
  <SubmitButtonControl(<None>=) (readonly)>
  <HiddenControl(_next=/places/default/index) (readonly)>
  <HiddenControl(_formkey=72282515-8f0d-4af1-9500-f7ac6f0526a4) (readonly)>
  <HiddenControl(_formname=login) (readonly)>>
Population before: 1330044000
Population after: 1330044001
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表單交互$

文檔：http://www.search.sourceforge.net/mechanize/

3.2用普通方法支持登錄後網頁內容更新

# -*- coding: utf-8 -*-
import urllib
import urllib2
import login

#COUNTRY_URL = 'http://example.webscraping.com/edit/United-Kingdom-239'
COUNTRY_URL = 'http://127.0.0.1:8000/places/default/edit/China-47'
	
def edit_country():
    opener = login.login_cookies()
    country_html = opener.open(COUNTRY_URL).read()
    data = login.parse_form(country_html)
    import pprint; pprint.pprint(data)
    print 'Population before: ' + data['population']
    data['population'] = int(data['population']) + 1
    encoded_data = urllib.urlencode(data)
    request = urllib2.Request(COUNTRY_URL, encoded_data)
    response = opener.open(request)

    country_html = opener.open(COUNTRY_URL).read()
    data = login.parse_form(country_html)
    print 'Population after:', data['population']

if __name__ == '__main__':
    edit_country()

wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表單交互$ python 3edit_country.py 
http://127.0.0.1:8000/places/default/index
{'_formkey': '3773506a-ef5e-4c4a-871d-084cb8451659',
 '_formname': 'places/5087',
 'area': '9596960.00',
 'capital': 'Beijing',
 'continent': 'AS',
 'country': 'China',
 'currency_code': 'CNY',
 'currency_name': 'Yuan Renminbi',
 'id': '5087',
 'iso': 'CN',
 'languages': 'zh-CN,yue,wuu,dta,ug,za',
 'neighbours': 'LA,BT,TJ,KZ,MN,AF,NP,MM,KG,PK,KP,RU,VN,IN',
 'phone': '86',
 'population': '1330044001',
 'postal_code_format': '######',
 'postal_code_regex': '^(\\d{6})$',
 'tld': '.cn'}
Population before: 1330044001
Population after: 1330044002
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表單交互$

Wu_Being 博客聲明：本人博客歡迎轉載，請標明博客原文和原鏈接！謝謝！
【Python爬蟲系列】《【Python爬蟲6】表單交互》http://blog.csdn.net/u014134180/article/details/55507020
Python爬蟲系列的GitHub代碼文件：https://github.com/1040003585/WebScrapingWithPython

【Python爬蟲6】表單交互

文章目錄

1.手工處理髮送POST請求提交登錄表單

1.1分析表單內容

1.2手工測試post請求提交表單

1.3手工處理post請求登錄的完整源代碼：

2.從FF瀏覽器加載cookie登錄網站

2.1session文件位置

2.2FF瀏覽器cookie內容

2.3使用cookie測試加載登錄

2.4使用cookie登錄源代碼

3.使用高級模塊Mechanize自動化處理表單提交

3.1用高級模塊Mechanize自動化處理表單提交併支持登錄後網頁內容更新

3.2用普通方法支持登錄後網頁內容更新

《Python進階》學習筆記

Leetcode 3161. 物塊放置查詢

leetcode 60 排列序列

一個docker容器暴露多個端口

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

Common Clock Framework系統結構--wowo

Linux power supply class(1)_軟件架構及API彙整--wowo

各種“擠牙膏式” 優化android 功耗

Linux common clock framework(1)_概述--wowo

【Python爬蟲7】驗證碼處理

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結