本篇文章主要介紹 urllib 庫相關函數的使用。

urllib 能夠模擬瀏覽器進行網絡請求，也能夠對服務器返回的數據進行保存。urllib 主要包括幾個模塊：

模塊	描述
urllib.request	打開和讀取 URL
urllib.error	包含 urllib.request 引發的異常
urllib.parse	解析 URL
urllib.robotparser	解析 robots.txt 文件

Urllib

常用函數

在 urllib 庫中，主要用到的函數有：

urlopen

def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
            *, cafile=None, capath=None, cadefault=False, context=None):

該函數能夠發起 URL 請求，主要的參數爲：

url：表示請求的 URL
data：表示請求的 URL 的 data，如果設置了該參數，該 URL 請求就變成了 POST 請求

如果發送的是 http/https URL，那麼對於函數的返回值，官方給出的說法爲：

This function always returns an object which can work as a context
manager and has methods such as

* geturl() - return the URL of the resource retrieved, commonly used to
  determine if a redirect was followed

* info() - return the meta-information of the page, such as headers, in the
  form of an email.message_from_string() instance (see Quick Reference to
  HTTP Headers)

* getcode() - return the HTTP status code of the response.  Raises URLError
  on errors.

For HTTP and HTTPS URLs, this function returns a http.client.HTTPResponse
object slightly modified. In addition to the three new methods above, the
msg attribute contains the same information as the reason attribute ---
the reason phrase returned by the server --- instead of the response
headers as it is specified in the documentation for HTTPResponse.

也就是說，此時函數的返回值爲 http.client.HTTPResponse 對象，HTTPResponse 類是 python 自帶的 http 庫中 http 類的一個子類，在該子類下，能夠使用該子類對應的方法，如 read(),readline(),readlines() 和 getcode() 方法等

from urllib import request

response = request.urlopen('https://www.baidu.com/')
print(type(response))
print(response.read())

結果爲：

<class 'http.client.HTTPResponse'>
b'<html>\r\n<head>\r\n\t<script>\r\n\t\tlocation.replace(location.href.replace("https://","http://"));\r\n\t</script>\r\n</head>\r\n<body>\r\n\t<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>\r\n</body>\r\n</html>'

response.read() 打印的結果前的 b 表示 bytes，是一種數據類型。

request.Request

該類的“構造函數”爲：

def __init__(self, url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None):

其中的 headers 可以用來設置 request headers，對爬蟲進行僞裝。

from urllib import request

url = 'http://www.baidu.com/s?wd=python'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}
req = request.Request(url,headers=headers)
response = request.urlopen(req)
print(response.read())

結果爲：

b'<!DOCTYPE html>\n<html lang="zh-CN">\n<head>\n    <meta charset="utf-8">\n    <title>\xe7\x99\xbe\xe5\xba\xa6\xe5\xae\x89\xe5\x85\xa8\xe9\xaa\x8c\xe8\xaf\x81</title>\n    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n    <meta name="apple-mobile-web-app-capable" content="yes">\n    <meta name="apple-mobile-web-app-status-bar-style" content="black">\n    <meta name="viewport" content="width=device-width, user-scalable=no, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0">\n    <meta name="format-detection" content="telephone=no, email=no">\n    <link rel="shortcut icon" href="https://www.baidu.com/favicon.ico" type="image/x-icon">\n    <link rel="icon" sizes="any" mask href="https://www.baidu.com/img/baidu.svg">\n    <meta http-equiv="X-UA-Compatible" content="IE=Edge">\n    <meta http-equiv="Content-Security-Policy" content="upgrade-insecure-requests">\n    <link rel="stylesheet" href="https://wappass.bdimg.com/static/touch/css/api/mkdjump_8befa48.css" />\n</head>\n<body>\n    <div class="timeout hide">\n        <div class="timeout-img"></div>\n        <div class="timeout-title">\xe7\xbd\x91\xe7\xbb\x9c\xe4\xb8\x8d\xe7\xbb\x99\xe5\x8a\x9b\xef\xbc\x8c\xe8\xaf\xb7\xe7\xa8\x8d\xe5\x90\x8e\xe9\x87\x8d\xe8\xaf\x95</div>\n        <button type="button" class="timeout-button">\xe8\xbf\x94\xe5\x9b\x9e\xe9\xa6\x96\xe9\xa1\xb5</button>\n    </div>\n    <div class="timeout-feedback hide">\n        <div class="timeout-feedback-icon"></div>\n        <p class="timeout-feedback-title">\xe9\x97\xae\xe9\xa2\x98\xe5\x8f\x8d\xe9\xa6\x88</p>\n    </div>\n\n<script src="https://wappass.baidu.com/static/machine/js/api/mkd.js"></script>\n<script src="https://wappass.bdimg.com/static/touch/js/mkdjump_6003cf3.js"></script>\n</body>\n</html><!--25127207760471555082051323-->\n<script> var _trace_page_logid = 2512720776; </script>'

urlretrieve

def urlretrieve(url, filename=None, reporthook=None, data=None):

該函數能夠將請求的 URL 保存爲本地名爲 filename 的文件。沒有返回值。

from urllib import request

request.urlretrieve('http://www.baidu.com/','saved.html')

urlencode

def urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=quote_plus):

對於該函數的作用，官方給出的說法爲：

Encode a dict or sequence of two-element tuples into a URL query string.

也就是說，該函數可以將字典或者雙元素元組編碼爲 URL 查詢字符串。

from urllib import request,parse

di = {'名字':'zhangsan',
      '性別':'男'}
di_encode = parse.urlencode(di)
print(di_encode)

結果爲：

%E5%90%8D%E5%AD%97=zhangsan&%E6%80%A7%E5%88%AB=%E7%94%B7

parse_qs

def parse_qs(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace'):

如果有編碼，當然就會有解碼，該函數可以視爲 urlencode 的逆過程。只是 parse_qs 中的 encoding 的格式默認爲 utf-8。

from urllib import request,parse

di = {'名字':'zhangsan',
      '性別':'男'}
di_encode = parse.urlencode(di)
print(di_encode)
di_qs = parse.parse_qs(di_encode)
print(di_qs)

結果爲：

%E5%90%8D%E5%AD%97=zhangsan&%E6%80%A7%E5%88%AB=%E7%94%B7
{'名字': ['zhangsan'], '性別': ['男']}

urlparse

def urlparse(url, scheme='', allow_fragments=True):

上邊的函數能夠將 URL 按照以下六部分進行解析：

<scheme>://<netloc>/<path>;<params>?<query>#<fragment>

返回值也是上邊六部分的元組。

from urllib import request,parse

url = 'http://www.baidu.com/s?wd=python'
url_parse = parse.urlparse(url)
print(url_parse)

結果爲：

ParseResult(scheme='http', netloc='www.baidu.com', path='/s', params='', query='wd=python', fragment='')

urlsplit

def urlsplit(url, scheme='', allow_fragments=True):

上邊的函數能夠將 URL 按照以下五部分進行解析：

<scheme>://<netloc>/<path>?<query>#<fragment>

返回值也是上邊五部分的元組。可以看出相比較於 urlparse 函數，該函數不會解析 params 部分。

from urllib import request,parse

url = 'http://www.baidu.com/s?wd=python'
url_parse = parse.urlsplit(url)
print(url_parse)

結果爲：

SplitResult(scheme='http', netloc='www.baidu.com', path='/s', query='wd=python', fragment='')

ProxyHandler

有些網站會設置反爬蟲機制，檢測某個 IP 地址的訪問情況，如果該地址的訪問出現異常，那麼就會對該 IP 的訪問做出限制，因此在構建爬蟲的時候，也可以設置代理來避免這個問題。

urllib 中使用 ProxyHandle 來設置代理服務器：

from urllib import request

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}

# no proxy
response = request.urlopen('http://httpbin.org/ip')
print(response.read())

# using a proxy
proxy = request.ProxyHandler({"http" : "125.126.120.169:60004"})
opener = request.build_opener(proxy)
req = request.Request('http://httpbin.org/ip',headers=headers)
response = opener.open(req)
print(response.read())

向網址 httpbin.org 發送一個 get 請求能夠得到當前主機的 IP 的地址，因此，上邊的結果爲：

b'{\n  "origin": "223.90.237.229"\n}\n'
b'{\n  "origin": "125.126.120.169"\n}\n'

ProxyHandle 是一個類，構建類對象時需要提供代理 IP 的字典。

這裏還遇到過一個很有意思的現象，如果使用 VPN 運行上邊代碼的話，兩次打印的 IP 是同一個地址，均爲外部的地址。

Cookie

在 chrome 瀏覽器中的設置->高級->網站設置->Cookie中可以查看到瀏覽器保存的 Cookie 信息
一般情況下，向服務器發送的 http/https 的請求是無狀態的，因此如果是在登陸狀態下進行的請求需要再一次輸入登錄的 ID，這種繁瑣的操作無疑會嚴重影響用戶的使用體驗，而 Cookie 就是用來解決這個問題的
在初次登陸後服務器會發送一些數據(也就是 Cookie)給瀏覽器，瀏覽器會將之保存在本地，當用戶再次向同一個服務器發送請求的時候，就會使用保存在本地的 Cookie 信息，這樣就不用再次輸入登陸信息了
當然也不是所有的信息都能保存爲 Cookie 信息的，Cookie 本身存儲的數據量也是有限的，不同的瀏覽器有不同的存儲大小，但一般都不會超過 4KB

Cookie 的格式

Set-Cookie: NAME=VALUE; Expires/Max-age=DATE; Path=PATH; Domain=DOMAIN_NAME; SECURE

NAME：Cookie 的名字。
VALUE：Cookie 的值。
Expires/Max-age：Cookie 的過期時間。
Path：Cookie 作用的路徑。
Domain：Cookie 作用的域名。
SECURE：是否只在 https 協議下啓動。

登陸訪問

如果在登陸狀態下發送 http/https 請求，需要使用 Cookie 信息，而解決該問題的方法有兩種：

使用瀏覽器的 F12，保存登陸狀態下的 Cookie，並將之放入 headers
使用關於 Cookie 的函數庫來解決

from urllib import request

url = 'https://www.douban.com/'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
           'Referer':'https://www.douban.com/',
           'Cookie':'這裏填寫自己在瀏覽器中複製的 Cookie 信息'}

req = request.Request(url,headers=headers)
response = request.urlopen(req)
with open('saved.html','w',encoding='utf-8') as fp:
    fp.write(response.read().decode('utf-8'))

使用這種方法就可以將登陸狀態的 html 文件直接保存在本地。但也可以使用一些關於 Cookie 的庫來對 Cookie 進行處理。

http.cookiejar

在這一模塊中關於 Cookie 的主要的類有 CookieJar,FileCookieJar,MozilaCookieJar,LWPCookieJar。

CookieJar	管理 HTTP cookie 的值存儲 HTTP 請求生成的 Cookie 向 HTTP 請求中添加 Cookie 此時的 Cookie 都存儲在內存中，對 Cookie 實例銷燬之後對應的 Cookie 也會消失
FileCookieJar	CookieJar 的派生類檢索 Cookie 信息並將之存儲到文件也可以讀取文件內容
MozilaCookieJar	FileCookieJar 的派生類創建與 Mozila 瀏覽器的 cookie.txt 兼容的 FileCookieJar 實例
LWPCookieJar	FileCookieJar 的派生類創建與 libwww-perl 標準的 Set-Cookie3 文件格式兼容的 FileCookieJar 的實例

from urllib import request,parse
from http.cookiejar import CookieJar

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
}

login_url = "http://www.renren.com/ajaxLogin/login"
target_url = 'http://www.renren.com/880151247/profile'
user_info = {"email": "用戶名", "password": "密碼"}

cookie = CookieJar()
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)

data = parse.urlencode(user_info).encode('utf-8')
req = request.Request(login_url,data=data,headers=headers)
response = opener.open(req)

req = request.Request(target_url,headers=headers)
response = opener.open(req)
with open('saved.html','w',encoding='utf-8') as fp:
    fp.write(response.read().decode('utf-8'))

上面的程序會將某個特定用戶的頁面保存下來，但是不以登陸狀態訪問的話是進入不到特定用戶的主頁的，因此使用上面的頁面可以實現。

本來是想使用上面的策略訪問豆瓣網的個人主頁的，但是會報"參數缺失"的錯誤，不知道哪裏搞錯了。

保存 Cookie 到本地

對於網頁中的 Cookie 信息，也可以使用 cookieJar 進行本地保存：

from urllib import request
from http.cookiejar import MozillaCookieJar

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
}

url = "https://www.baidu.com"

cookie = MozillaCookieJar('cookie.txt')
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)

req = request.Request(url,headers=headers)
response = opener.open(req)

cookie.save(ignore_discard=True,ignore_expires=True)

這樣就將所訪問網頁的 cookie 信息保存在了本地名爲 cookie.txt 的文件中。

從本地加載 cookie

既然能將 cookie 保存到本地，就也能夠從本地加載 cookie 信息：

from urllib import request
from http.cookiejar import MozillaCookieJar

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
}

url = "https://www.baidu.com"

cookie = MozillaCookieJar('cookie.txt')
cookie.load(ignore_discard=True,ignore_expires=True)
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)

req = request.Request(url,headers=headers)
response = opener.open(req)

這樣就將本地的 cookie 信息加載到了創建的 MozillaCookieJar 對象中。

使用 cookieJar 對象的步驟

創建一個對應的 cookieJar 對象
將該對象傳遞給一個 HTTPCookieProcessor 對象
構建一個 opener
使用 opener 的 open 方法發送 http/https 請求

Python網絡爬蟲(四)——urllib

Urllib

常用函數

urlopen

request.Request

urlretrieve

urlencode

parse_qs

urlparse

urlsplit

ProxyHandler

Cookie

Cookie 的格式

登陸訪問

http.cookiejar

保存 Cookie 到本地

從本地加載 cookie

使用 cookieJar 對象的步驟

EXCEL中下拉菜單中添加新選項或者刪除選項

號稱能打敗MLP的KAN到底行不行？數學核心原理全面解析

同事使用 insert into select 遷移數據，開開心心上線，上線後被公司開除！

Git使用經驗總結5-修改提交信息

Python 爬蟲：Spring Boot 反爬蟲的成功案例

京東科技數字化營銷能力的演進與最佳實踐| 京東雲技術團隊

Git使用經驗總結4-撤回上一次本地提交

Java中止線程的方式

壓榨數據庫的真實處理速度

國內SaaS遇冷？未來企業服務賽道是否還有機會？

Python網絡爬蟲(二十三)——Redis

Python網絡爬蟲(十九)——CrawlSpider

Python網絡爬蟲(二十四)——Scrapy-Redis

Python網絡爬蟲(二十二)——Downloader Middlewares

Python網絡爬蟲(二十一)——Request 和 Response

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結