一、簡單聊兩句

最近因爲有需求需要找一些高清無版權圖片，但是國內各個網站看完…
嗯，要錢
那還是自己爬國外的吧（emmmm能省則省）
順手安利一波我這次選取的圖片網站foodiesfeed，一個美食素材圖片網站
https://www.foodiesfeed.com/

二、網頁源碼分析

先隨手搜一波關鍵詞頁面參數輸入點，ok找到

老規矩F12

圖片真實地址找到，有不同規格圖片，可以按需求自行選擇。
下面開始寫代碼：

獲得圖片地址

	#獲取地址
    url='https://www.foodiesfeed.com/tag/'+keyword+'/page/'+page
    
    r = requests.get(url)
    rand = random.randint(1,5)
    time.sleep(rand)
    soup = BeautifulSoup(r.content, "html.parser")
    # 用BeautifulSoup這個庫解析html格式的字符串，把網頁的源碼解析成一個個類

    all_a = soup.find_all('img',class_='cover-img wp-post-image')
    #print(all_a)
    # 根據網頁源碼分析，圖片存儲在標籤div下的img標籤中，img標籤的class都爲cover-img wp-post-image

下載圖片

 		try:
            r = requests.get(imgurl,timeout=3)
        except:
            continue
            
        path1= path +'/%s.jpg'%i
        if not os.path.exists(path1):
            try:
                with open(path+'/%s.jpg'%i, 'wb') as f:
                    f.write(r.content)
                print("success_download")
            except:
                print("something wrong")
        else:
              print("圖片已存在")

我們在下圖片時經常會遇到下載失敗的情況，其中有種情況是出現urllib.ContentTooShortError錯誤，原因是文件下載不完全導致的錯誤，爲防止圖片下載失敗我這裏是設置了一個異常處理

try:
            r = requests.get(imgurl,timeout=3)  #timeout參數自行設置，我這裏是3秒
        except:
            continue

當然也可以用socket.setdefaulttimeout(30)方法，我這給個原來寫的代碼做個示例，具體就不寫了，感興趣可以百度瞧瞧


    socket.setdefaulttimeout(30)
    try:
        urllib.request.urlretrieve(url, filename)
    except socket.timeout:
        count = 1
        while count <= 5:
            try:
                urllib.request.urlretrieve(url, filename)
                break
            except socket.timeout:
                err_info = 'Reloading for %d time' % count if count == 1 else 'Reloading for %d times' % count
                print(err_info)
                count += 1
        if count > 5:
            print("downloading picture failed!")

最後跑代碼：

if __name__ == '__main__':
    keyword = input('請輸入關鍵字：')
    # 輸入需要 搜索圖片的關鍵詞
    pagenumber = input('請輸入爬取的頁數：')
    pagenumber = int(pagenumber)
    # 輸入需要爬取圖片的頁數
    # keyword = urllib.parse.quote(keyword)
    is_file_exist(keyword)
    # 判斷文件夾是否存在
    for page in range(1,pagenumber+1):
        download_images(keyword,page)

我爬取的圖片：

代碼需要的話可以到我的gitee下載（https://gitee.com/ceasarxo/python-pic-crawler）
ps：這個網站真的是我爬取的最快的外網…
其他的速度太慢了crying…

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python爬蟲---foodiesfeed無版權高清圖像爬取

一、簡單聊兩句

二、網頁源碼分析

前端使用 Konva 實現可視化設計器（13）- 折線 - 最優路徑應用【思路篇】

[buuoj][極客大挑戰 2019]Havefun

[buuoj][極客大挑戰 2019]Secret File

[buuoj][極客大挑戰 2019]LoveSQL

[buuoj記錄][ACTF2020 新生賽]Include

[buuoj][極客大挑戰 2019]EasySQL

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結