Part1 基礎定義
無狀態http協議:
session:服務器端的狀態保持機制,需要想辦法在發送請求的時候攜帶sessionID。
cookies:客戶端的狀態保持機制,將信息存儲在本地,有被劫持的危險;數量和大小有限制。有些狀態不適合在服務端保持。
參考:
Session和Cookie的區別與聯繫1
Session和Cookie的區別與聯繫2
Part2 爬蟲中的應用
#用於session模擬登陸知乎
import requests
import http.cookiejar
from bs4 import BeautifulSoup
session = requests.Session()
session.cookies = http.cookiejar.LWPCookieJar("cookie")
agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/5.1.2.3000 Chrome/55.0.2883.75 Safari/537.36'
headers = {
"Host": "www.zhihu.com",
"Origin":"https://www.zhihu.com/",
"Referer":"http://www.zhihu.com/",
'User-Agent':agent
}
postdata = {
'password': '*******', #填寫密碼
'account': '********', #填寫帳號
}
response = session.get("https://www.zhihu.com", headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
xsrf = soup.find('input', attrs={"name": "_xsrf"}).get("value")
postdata['_xsrf'] =xsrf
result = session.post('http://www.zhihu.com/login/email', data=postdata, headers=headers)
session.cookies.save(ignore_discard=True, ignore_expires=True)
#github模擬登陸
import requests
from lxml import html
LOGIN_URL = 'https://github.com/login'
SESSION_URL = 'https://github.com/session'
s = requests.session()
r = s.get(LOGIN_URL)
tree = html.fromstring(r.text)
el = tree.xpath('//input[@name="authenticity_token"]')[0]
authenticity_token = el.attrib['value']
data = {
'commit': 'Sign in',
'utf8': '✓',
'authenticity_token': authenticity_token,
'login': 'orzrd',
'password': 'qwe123!@#'
}
r = s.post(SESSION_URL, data=data)
#維護session
#此部分代碼來https://blog.csdn.net/zinczhang/article/details/80234217
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('url')
cookie_list = browser.get_cookies()
for cookie in cookie_list:
cookies[cookie['name']] = cookie['value']
#因爲只使用requests發送請求不便於cookie的維護,時間長cookie可能會失效,所以使用requests.session()獲取session實例進行cookie維護。
#但是requests只能保持 cookiejar 類型的cookie,而我們手動構建的cookie是dict類型的。所以要把dict轉爲 cookiejar類型。
cookies = requests.utils.cookiejar_from_dict(cookie_dict, cookiejar=None, overwrite=True)
session = requests.session()
session.cookies = cookies