urllib.request

一定義

urllib 庫是python 內置的HTTP請求庫

其官方文檔鏈接爲： https://docs.python.org/3/library/urllib.htrnl。

二、urllib四大模塊

1 request模塊

request ：它是最基本的 HTTP 請求模塊，可以用來模擬發送請求。就像在瀏覽器裏輸入網址然後回車一樣，只需要給庫方法傳入 URL 以及額外的參數，就可以模擬實現這個過程了。

2 error模塊

異常處理模塊，如果出現請求錯誤，我們可以捕獲這些異常，然後進行重試或其他操作以保證程序不會意外終止。

3 parse模塊

一個工具模塊，提供了許多 URL 處理方法，比如拆分、解析、合併等。

4 robotparser模塊

主要是用來識別網站的 robots.txt 文件，然後判斷哪些網站可以爬，哪些網站不可以爬，它其實用得比較少。

三、發送請求

1.urlopen()發送請求

response.read(）

獲取網頁源碼

type(reponse)

獲取response的類型

response.getheaders()

獲取報文的頭部信息

response.getheader(“Server”)

傳入Server參數後，只返回Server的名稱

bytes(urllib.parse.urlencode({‘key’:‘value’}),encoding=‘utf-8’)

將字符串形式的data轉換爲字節流形式的data

參數介紹

urllib. request. urlopen(url, data=None, [timeout]*, cafile=None, capath=None, cadefault=False, context=None)

1 data 參數

如果要添加該參數，並且如果它是字節流編碼格式的內容，即 bytes 類型，則需要通過 bytes（）方法轉化。另外，如果傳遞了這個參數，則它的請求方式就不再是 GET方式，而是 POST 方式。
下面用實例來看一下：

import urllib.parse

import urllib.request 

data = bytes(urllib.parse.urlencode({'word’:’hello'}), encoding＝’ utf-8')

 response= urllib.request.urlopen('http://httpbin.org/post’, data=data)

 print(response.read())

這裏我們傳遞了一個參數 word ，值是 hello 它需要被轉碼成 bytes （字節流）類型。其中轉字節流採用了 bytes（）方法，該方法的第一個參數需要是 str （字符串）類型，需要用 urllib.parse 模塊裏的 urlencode （）方法來將參數字典轉化爲字符串；字典-------字符串-------字節流 urlencode()-------bytes()

第二個參數指定編碼格式，這裏指定爲 utf8。

2 timeout

超時時間單位爲秒

處理timeout異常代碼

import socket
from urllib import request,error

url='http://www.baidu.com'

try:
	response=request.urlopen(url,timeout=1)
	print(response.read().decode('utf-8'))
except urllib.error.Urlerror as e:
	if isinstance(e.reason,socket.timeout):
		print('timeout')

3cafile 和 capath

這兩個參數分別指定 CA證書和它的路徑，這個在請求 HTTPS 鏈接時會有用。

官方文檔： https://docs.python.org／Iibrary /url I ib. request. html。

2.另一個發送請求的函數Request（）

1 request可定製headers

利用 urlopen（）方法可以實現最基本請求的發起，但這幾個簡單的參數並不足以構建一個完整的請求。如果請求中需要加入 Headers 等信息，就可以利用更強大的 Request 類來構建。

其實Request（）就是重構請求頭的數據結構，將這個請求頭變成一個Rquest對象，發送請求仍然由urllib完成

通過構造這個數據結構，一方面我們可以將請求獨立成一個對象，另一方面可更加豐富和靈活地配置參數

2.參數介紹

class  urllib.request.Request(url,data=None,headers={},origin_req_host=None,unverifiable=Flase,method=None)

(1)url 必傳參數

(2) data

參數，如果要傳，必須傳bytes(字節流)類型的。如果它是字典，可以先用urllib.parse模塊裏的urlencode()編碼

(3)headers

字典，添加請求頭的兩種方法

構造請求頭時，通過headers參數直接構造

調用請求示例的add_headers()方法添加

代碼示例：

#使用Request（）方法構造headers

import urllib.request
import urllib.parse

url='http://httpbin.org/post'

headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36'}

data={'key':'value'}

data=bytes(urllib.parse.urlencode(data),encoding='utf-8')

request=urllib.request.Request(url=url,headers=headers,data=data)

response=urllib.request.urlopen(request)

print(response.read().decode('utf-8'))

在Request對象後面使用add_headers()方法添加

#使用add_header
#request對象.add_user(字段名，字段值)
import urllib.request
import urllib.parse
url='http://www.baidu.com'

request=urllib.request.Request(url=url)

request.add_header('user-agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36')

r=urllib.request.urlopen(request)

print(r.read().decode('utf-8'))

(4)origin_req_host

請求方的host名稱或者IP地址

請求頭僞裝的不好，可能通過該字段找到原始IP地址

(5)unverifiable?

表示這個請求是否是無法驗證的，默認是 False，意思就是說用戶沒有足夠權限來選擇接收這個請求的結果。例如，我們請求一個 HTML 文檔中的圖片，但是我們沒有向動抓取圖像的權限，這時 unverifiable 的值就是 True

(6)method

3 驗證、cookies 、代理請求頭部的建立

簡介Handdler

簡而言之，我們可以把它理解爲各種處理器，有專門處理登錄驗證的，有處理 Cookies 的，有處理代理設置的。利用它們，我們幾乎可以做到 HTTP 請求中所有的事情。

（1）驗證(構造請求體裏的內容)

from  urllib.request import HTTPPasswordMgrWithDefaultRealm,HTTPBasicAuthHandler,build_opener
from  urllib.error import URLError

username='username'
password='password'
url='http://localhost:5000/'
#構造HTTPPassword對象
p=HTTPPasswordMgrWithDefaultRealm()
#實例化 HTTPBasicAuthHandler 對象
p.add_password(None,url,username,password)
auth_handler=HTTPBasicAuthHandler(p)
opener=build_opener(auth_handler)
#總結：HTTPPasswordMgrWithDefaultRealm()----->實例化add_password()
#----->HTTPBasicAuthHandler(p)------->
try:
  result=opener.open(url)
  html=result.read().decode('utf-8')
  print(html)
except URLError as e:
  print(e.reason)

（2)代理

from urllib.error import URLError
from urllib.request import ProxyHandler,build_opener
import urllib

proxy_handler = ProxyHandler({
  'http':'http://127.0.0.1:9743',
  'https':'http://127.0.0.1.9743'
})
opener=build_opener(proxy_handler)
try:
  url='http://www.baidu.com'
  response=opener.open(url)
  print(response.read().decode('utf-8'))
except urllib.error.URLError as e:

  print(e.reason)

[WinError 10061] 由於目標計算機積極拒絕，無法連接”出現這種情況的原因：
因爲這是你的本地9743端口上並沒有創建HTTP代理服務，即沒有創建代理爲127.0.0.0：9743的代理服務，所以會報錯！

解決辦法：
在西刺找到可以使用的免費的代理服務IP就可以啦！

西刺代理：https://www.xicidaili.com/nn/

原文鏈接：https://blog.csdn.net/qq_42908549/article/details/86706161

（3）Cookies

類太多，requests庫有更好的庫函數。

urllib.request

urllib.request

文章目錄

一 定義

二、urllib四大模塊

1 request模塊

2 error模塊

3 parse模塊

4 robotparser模塊

三、發送請求

1.urlopen()發送請求

相關函數

response.read(）

type(reponse)

response.getheaders()

response.getheader(“Server”)

bytes(urllib.parse.urlencode({‘key’:‘value’}),encoding=‘utf-8’)

參數介紹

1 data 參數

2 timeout

3cafile 和 capath

2.另一個發送請求的函數Request（）

1 request可定製headers

2.參數介紹

(1)url 必傳參數

(2) data

(3)headers

(4)origin_req_host

(5)unverifiable?

(6)method

3 驗證、cookies 、代理請求頭部的建立

簡介Handdler

（1）驗證(構造請求體裏的內容)

（2)代理

（3）Cookies

一定義