爲什麼要模擬登陸？

Python網絡爬蟲應用十分廣泛，但是有些網頁需要用戶登陸後才能獲取到信息，所以我們的爬蟲需要模擬用戶的登陸行爲，在登陸以後保存登陸信息，以便瀏覽該頁面下的其他頁面。

保存用戶信息

模擬登陸後有兩種方法可以保存用戶信息，通過Session來保存登陸信息或者通過Cookie來保存登陸信息

一、Session的用法

導入requests模塊

import requests

通過requests的Session來請求網頁

s = requests.Session()

r = s.post(url, headers=headers)

二、Cookie的用法

import urllib.request, http.cookiejar

初始化Cookie

cookie = http.cookiejar.CookieJar()

opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie))

把opener配置爲全局當然也可以不配置全局通過opener來請求網頁

urllib.request.install_opener(opener)

模擬登陸實踐

我們以豆瓣網爲例模擬用戶登陸，然後爬取登陸後的用戶界面

（1）找到請求表單

登陸一般是通過Post請求來實現的，其傳遞參數爲一個表單，如果要成功登陸，我們需要查看該表單傳遞了哪些內容，然後構造表單做Post請求。怎麼獲取表單了，我們只需要打開瀏覽器右鍵查看，然後輸入賬號密碼，點擊登陸查看其NetWork中的請求，找到表單信息即可（推薦使用谷歌瀏覽器），該信息中還能找到請求的url。

表單信息

URL

（2）構建表單

表單的key值我們可以通過右鍵頁面檢查頁面源代碼，在頁面源碼中獲得靜態的值（還有些動態信息需要手動獲取）

formdata = {

'redir': 'https://www.douban.com',

'form_email': '賬號',

'form_password': '密碼',

'login': u'登陸'

}

（3）僞裝成瀏覽器進行登錄

我們只需要給請求添加上Headers即可

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '

'Chrome/55.0.2883.87 Safari/537.36'}

（4）獲取驗證碼

第二步的表單其實還不完整，還差兩條跟驗證碼有關的信息，這兩條信息是動態變化的，所以我們要手動獲取

r = s.post(url_login, headers=headers)

content = r.text

soup = BeautifulSoup(content, 'html.parser')

captcha = soup.find('img', id='captcha_image')#當登陸需要驗證碼的時候

if captcha:

captcha_url = captcha['src']

re_captcha_id = r'<input type="hidden" name="captcha-id" value="(.*?)"/'

captcha_id = re.findall(re_captcha_id, content)

print(captcha_id)

print(captcha_url) # 打印驗證碼url

captcha_text = input('Please input the captcha:') # 手動輸入驗證碼

formdata['captcha-solution'] = captcha_text # 添加表單信息

formdata['captcha-id'] = captcha_id

（5）登錄

r = s.post(url_login, data=formdata, headers=headers) # 將表單信息傳入參數中請求頁面即可登錄

完整代碼

-- coding: utf-8 --

import requests

import re

from bs4 import BeautifulSoup

s = requests.Session()

url_login = 'https://accounts.douban.com/login'

formdata = {

'redir': 'https://www.douban.com',

'form_email': '賬號',

'form_password': '密碼',

'login': u'登陸'

}

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '

'Chrome/55.0.2883.87 Safari/537.36'}

r = s.post(url_login, data=formdata, headers=headers)

content = r.text

soup = BeautifulSoup(content, 'html.parser')

captcha = soup.find('img', id='captcha_image')#當登陸需要驗證碼的時候

if captcha:

captcha_url = captcha['src']

re_captcha_id = r'<input type="hidden" name="captcha-id" value="(.*?)"/'

captcha_id = re.findall(re_captcha_id, content)

print(captcha_id)

print(captcha_url)

captcha_text = input('Please input the captcha:')

formdata['captcha-solution'] = captcha_text

formdata['captcha-id'] = captcha_id

r = s.post(url_login, data=formdata, headers=headers)

with open('contacts.html', 'w+', encoding='utf-8') as f:

f.write(r.text)

運行結果

登陸成功

最後，如果你跟我一樣都喜歡python，想成爲一名優秀的程序員，也在學習python的道路上奔跑，歡迎你加入python學習羣：839383765 羣內每天都會分享最新業內資料，分享python免費課程，共同交流學習，讓學習變（編）成（程）一種習慣！

Python網絡爬蟲之模擬登陸

導入requests模塊

通過requests的Session來請求網頁

初始化Cookie

把opener配置爲全局當然也可以不配置全局通過opener來請求網頁

-- coding: utf-8 --

Python程序員！升級！這有一張最高效成長路線規劃（附資源）

利用Python來爬取“吃雞”數據，爲什麼別人能吃雞？

無需操作系統直接運行 Python 代碼

如何用Python爬網站數據，並用BI可視化分析？

零基礎入門Python爬蟲：三種分佈式爬蟲系統的架構方式！

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Python網絡爬蟲之模擬登陸

導入requests模塊

通過requests的Session來請求網頁

初始化Cookie

把opener配置爲全局 當然也可以不配置全局通過opener來請求網頁

-- coding: utf-8 --

把opener配置爲全局當然也可以不配置全局通過opener來請求網頁