爬蟲基礎

在之前兩篇文章中已經爲大家介紹了urllib模塊的基礎知識及使用方法，本次文章將介紹發送請求，爬蟲的異常處理和僞裝瀏覽器三個部分的知識。

1、發送請求

以百度檢索爲例：https://www.baidu.com/s?wd=python&ie=UTF-8

wd=檢索內容

import urllib.request
keywd='python'            #假如我們要檢索python
url='http://www.baidu.com/s?wd='+keywd
req=urllib.request.Request(url)
data=urllib.request.urlopen(req).read()
file=open('D:/1python/http模擬.html','wb')      #生成一個本地文件
file.write(data)                                                 #將爬取的網頁寫入本地文件
file.close()

如果檢索內容爲中文，還需進行編碼操作

key='編程'  
key=urllib.request.quote(key) #quote可對中文進行編碼
url='http://www.baidu.com/s?wd='+key

2、爬蟲的異常處理

在爬蟲過程中我們經常會遇到異常情況，若沒有異常處理，易產生中斷，使後續程序無法運行。在這裏爲大家簡單介紹一下HTTPError和URLError：

HTTPError是URLError的子類，我們在進行異常捕捉時只要捕獲URLError即可，但URLError是無狀態碼，無法捕獲具體的異常狀態，接下來會給大家展示如何對URLError進行處理，使得它可以顯示具體的錯誤內容。

import urllib.error
import urllib.request
try:
    urllib.request.urlopen('http://blog.csdn.net')
except urllib.error.URLError as e:
    if hasattr(e,'code'):  #判斷是否有狀態碼
        print(e.code)
    if hasattr(e,'reason'):   #判斷是否有原因
        print(e.reason)

3、僞裝瀏覽器

在爬蟲過程中，若被網站識別出是爬蟲程序，有可能會拒絕訪問，此時我們需要將自己僞裝成瀏覽器對網站進行訪問。

import urllib.request
url='http://blog.csdn.net/http://my.csdn.net/weiwei_pig'
#F12 network 刷新 request headers 
header=('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36')
opener=urllib.request.build_opener()   #添加報頭信息
opener.addheaders=[header]
urllib.request.install_opener(opener)   #安裝爲全局,如此下面爬取時會自動加入報頭
data=opener.open(url).read()
file=open('D:/1python/1.html','wb')
file.write(data)
file.close()

——來自韋瑋老師課堂筆記及所悟

python gdal 安裝使用（Windows， python 3.6.8）

SQL查詢性能優化

會員(用戶)數據化運營——指標介紹

會員(用戶)數據化運營——分析模型

dataframe行列查詢

Python基礎知識——作用域、函數、模塊

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結