舉例爬取百度貼吧上一張網頁上的圖片,附上相關html源碼,網址失效也無關係,重在分析學習。
<div id="post_content_87286618651" class="d_post_content j_d_post_content clearfix">
<img class="BDE_Image" src="http://imgsrc.baidu.com/forum/w%3D580/sign=b2310eb7be389b5038ffe05ab534e5f1/680c676d55fbb2fbc7f64cbb484a20a44423dc98.jpg" size="21406" changedsize="true" width="560" height="747" style="cursor: url("http://tb2.bdstatic.com/tb/static-pb/img/cur_zin.cur"), pointer;">
</div>
首先…
打開網頁
# -*- coding: utf-8 -*-
import requests
url = 'http://tieba.baidu.com/p/4468445702'
html = requests.get(url)
#指定編碼
html.encoding='utf-8'
然後…
獲取url (3種方式)
使用 BeautifulSoup 庫
from bs4 import BeautifulSoup
bs = BeautifulSoup(html.content,'html.parser')
img_list = bs.find('div',{'id':'post_content_87286618651'}).findAll('img')
img_src = img_list[0].attrs['src']
print(img_src)
使用xpath
from lxml import etree
selector = etree.HTML(html.content)
images = selector.xpath('//*[@id="post_content_87286618651"]/img')
print image.attrib.get('src')
使用正則表達式
import re
text = html.content
pattern = re.compile(r'<img .*src="(.*?)" size="21406"',re.S)
match = pattern.search(text)
print match.group(1)
最後…
將圖像寫入文件
img = requests.get(img_src)
with open('baidu_tieba.jpg', 'ab') as f:
f.write(img.content)
f.close()