【python學習】網絡爬蟲——基礎案例教程

一，獲取整個頁面數據

Urllib 模塊提供了讀取web頁面數據的接口，我們可以像讀取本地文件一樣讀取www和ftp上的數據。首先，我們定義了一個getHtml()函數:

　　urllib.urlopen()方法用於打開一個URL地址。

　　read()方法用於讀取URL上的數據，向getHtml()函數傳遞一個網址，並把整個頁面下載下來。執行程序就會把整個網頁打印輸出。

#coding=utf-8
import urllib

def getHtml(url):
    page = urllib.urlopen(url)
    html = page.read()
    return html

html = getHtml("http://tieba.baidu.com/p/2460150866")

print html

二，篩選頁面中想要的數據（這裏以圖片爲例，上篇已經介紹過篩選帖子）

　核心步驟是運用正則表達式，篩選出規定格式的內容，即自己想要的內容。

re模塊主要包含了正則表達式：

　　 re.compile() 可以把正則表達式編譯成一個正則表達式對象.

　　 re.findall() 方法讀取html 中包含 imgre（正則表達式）的數據。

圖片區域html代碼：

<img pic_type="0" class="BDE_Image" src="http://imgsrc.baidu.com/forum/w%3D580/sign=0d3b26024bed2e73fce98624b700a16d/0faccbef76094b3697eed26ba2cc7cd98d109d3d.jpg" pic_ext="jpeg" height="336" width="560">

運行腳本將得到整個頁面中包含圖片的URL地址，返回值是一個列表

import re
import urllib

def getHtml(url):
    page = urllib.urlopen(url)
    html = page.read()
    return html

def getImg(html):
    reg = r'src="(.+?\.jpg)" pic_ext'
    imgre = re.compile(reg)
    imglist = re.findall(imgre,html)
    return imglist      
   
html = getHtml("http://tieba.baidu.com/p/2460150866")
print getImg(html)

三，將圖片保存到本地

對比上步，核心是用到了urllib.urlretrieve()方法，直接將遠程數據下載到本地。

　　通過一個for循環對獲取的圖片連接進行遍歷，爲了使圖片的文件名看上去更規範，對其進行重命名，命名規則通過x變量加1。保存的位置可以設置，在D盤根目錄。

#coding=utf-8
import urllib
import re

def getHtml(url):
    page = urllib.urlopen(url)
    html = page.read()
    return html

def getImg(html):
    reg = r'src="(.+?\.jpg)" pic_ext'
    imgre = re.compile(reg)
    imglist = re.findall(imgre,html)
    x = 0
    for imgurl in imglist:
        urllib.urlretrieve(imgurl,'D:/%s.jpg' % x)
        x+=1

html = getHtml("http://tieba.baidu.com/p/2460150866")
getImg(html)

【python學習】網絡爬蟲——基礎案例教程

工作中用到的腳本合集

24-5-18 X

SecureCRT配置自動記錄日誌

【HBase-1】HBase安裝配置，使用獨立zookeeper

【面試算法題】有序鏈表基礎介紹，典型面試題講解

InputStream 、 InputStreamReader 、 BufferedReader區別

String、StringBuffer與StringBuilder之間區別

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結