【每日爬蟲】：利用線程池爬取2萬張裝修效果圖

文章目錄

一、前言

2020-04-08日爬蟲練習
每日一個爬蟲小練習，學習爬蟲的記得關注哦！

學習編程就像學習騎自行車一樣，對新手來說最重要的是持之以恆的練習。
在《汲取地下水》這一章節中看見的一句話：“別擔心自己的才華或能力不足。持之以恆地練習，才華便會有所增長”，現在想來，真是如此。

二、需求

具體參考我昨日爬蟲：【每日爬蟲】：給自己打造一個溫馨的家，面朝大海，春暖花開

三、技術路線

import requests 
import random, os, sys
from bs4 import BeautifulSoup  # 數據解析之BeautifulSoup4庫
import re,time  # 正則表達式
from concurrent.futures import ThreadPoolExecutor   # 線程池

關於線程池可以參考我免費專欄：python多線程與多進程編程
關於 requests 和 BeautifulSoup模塊可以關注我免費專欄：爬蟲學習筆記

四、線程池爬取2萬張裝修效果圖

'''
    線程池爬土巴兔裝修效果圖，按分類爬取

    version:02
    author：金鞍少年
    Blog：https://jasn67.blog.csdn.net/
    date：2020-04-08


    可以按照這個思路將所有涉及到網絡請求，添加到異步線程池中，這樣速度更快，但是對目標網站不友好，高頻請求可能會導致被封IP

'''
import requests
import random, os, sys
from bs4 import BeautifulSoup
import re,time
from concurrent.futures import ThreadPoolExecutor


class House_renderings():
    def __init__(self):
        self.pool = ThreadPoolExecutor(10)  # 開10個線程的線程池
        self.is_running = True   # 當is_running爲True時，說明程序還在運行
        # 戶型
        self.house_lis = '''
                        ------- 請選擇戶型 ---------
                        1:一居室
                        2:兩居室
                        3:三居室
                        4:四居室及以上
                        5:複式
                        6:別墅豪宅
                        7:其他
                        8:退出
                        '''

        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
            'Referer': 'https://xiaoguotu.to8to.com/'
        }

        # 代理ip
        self.all_proxies = [
            {'http': '183.166.20.179:9999'}, {'http': '125.108.124.168:9000'},
            {'http': '182.92.113.148:8118'}, {'http': '163.204.243.51:9999'},
            {'http': '175.42.158.45:9999'}]  # 需要自行去找一些免費的代理,參考我其他博客案例

        self.path = './res/'  # 本地存儲目錄

    # 獲取HTMl
    def get_html(self, url):
        try:
            result = requests.get(url=url, headers=self.headers, proxies=random.choice(self.all_proxies))
            result.raise_for_status()  # 主動拋出一個異常
            html = BeautifulSoup(result.text, 'lxml')
            return html
        except:
            print('鏈接失敗！')

    # 獲取分頁url
    def get_page_urls(self, url, html):
        try:
            Pages = list(html.find('div', class_="pages").find_all('a'))[-2].string
            for page in range(1, int(Pages) + 1):
                page_url = url+'p{}'.format(page)
                yield page_url
        except AttributeError:
            yield url

    # 獲取詳情頁面url
    def get_detail_urls(self, page_html):
        try:
            page_html = page_html.result()
            a_tag = page_html.find('div', class_="xmp_container").find_all('a', class_="item_img")
            for a in a_tag:
                detail_urls = 'https:' + a['href']
                self.pool.submit(self.get_html, detail_urls).add_done_callback(self.Save_detail_page)
        except Exception as e:
            print(e)

    # 獲取詳情頁內容,並保存到本地
    def Save_detail_page(self, detail_html):
        detail_html = detail_html.result()
        try:
            house_style = detail_html.find('ul', class_="tag_list xg_tag").find('a').string  # 裝修風格
            house_type = detail_html.find('ul', class_="tag_list xg_tag").find_all('a')[1].string  # 戶型
            atlas_name = detail_html.find('strong', id="fine_n").get_text()  # 圖集名
            atlas_name = re.sub(r"[\/\\\:\*\?\"\<\>\|]", "_", atlas_name)  # 轉義 Windows文件名中的非法字符方法

            file_path = self.path + house_type + '/' + house_style + '/' + atlas_name + '/'  # 拼接文件存儲路徑

            # 遞歸創建文件夾
            if not os.path.exists(file_path):
                os.makedirs(file_path)

            imgs = detail_html.find('div', class_="display-none").find_all('img')
            for index, img in enumerate(imgs):
                jpg = requests.get(img['src'], headers=self.headers, proxies=random.choice(self.all_proxies))
                with open(file_path + '%s.jpg' % (index + 1), 'wb')as f:
                    f.write(jpg.content)
                    print('{}-{}.jpg 保存成功！'.format(atlas_name, index))

            self.is_running = False  # 告訴主進程任務結束
        except Exception as e:
            if hasattr(e, 'reason'):
                print(f'抓取失敗，失敗原因：{e.reason}')



    # 選擇戶型
    def choice_house(self):
        while True:
            print(self.house_lis)
            choice = input("請選擇輸入序號選擇戶型 ：").strip()
            if choice == "1":
                return 'https://xiaoguotu.to8to.com/list-h2s7i0'
            elif choice == "2":
                return 'https://xiaoguotu.to8to.com/list-h2s2i0'
            elif choice == "3":
                return 'https://xiaoguotu.to8to.com/list-h2s3i0'
            elif choice == "4":
                return 'https://xiaoguotu.to8to.com/list-h2s4i0'
            elif choice == "5":
                return 'https://xiaoguotu.to8to.com/list-h2s5i0'
            elif choice == "6":
                return 'https://xiaoguotu.to8to.com/list-h2s6i0'
            elif choice == "7":
                return 'https://xiaoguotu.to8to.com/list-h2s8i0'
            elif choice == "8":
                print('退出成功!')
                sys.exit()
            else:
                print('輸入錯誤，重新輸入！')

    # 邏輯功能
    def func(self):
        house_classify_url = self.choice_house()
        html = self.get_html(house_classify_url)
        for url in self.get_page_urls(house_classify_url, html):
            self.pool.submit(self.get_html, url).add_done_callback(self.get_detail_urls)

        #  防止主線程結束，如果主進程結束，線程池也就關閉了
        while True:
            time.sleep(0.000001)  # 避免cpu空轉，浪費資源
            if not self.is_running:
                break

        self.pool.shutdown()  # 關閉線程池，不再接收新任務，但池內已有任務會繼續執行，所有任務執行完後該線程池中的所有線程都會死亡。

if __name__ == '__main__':
      # 開啓線程池
    h = House_renderings()
    h.func()

五、其他

1、可以按照這個思路將所有涉及到網絡請求，添加到異步線程池中，這樣速度更快，但是對目標網站不友好，高頻請求可能會導致被封IP，我只將部分網絡請求添加到線程池中，有需要的小夥伴可以自己修改。
2、有沒有想學python的小夥伴一起組隊戰拖延症啊，我感覺我的拖延症犯了，總是喜歡把任務拖到下午或者晚上，然後草草的解決。
3、不是廣告，也不賣課程，就是單純的想組隊戰拖，有願意一起學習的小夥伴私信或者評論留言，一起加個好友相關監督學習python吧。

【每日爬蟲】：利用線程池爬取2萬張裝修效果圖

文章目錄

一、前言

二、需求

三、技術路線

四、線程池爬取2萬張裝修效果圖

五、其他

vue項目獲取富文本編輯器wangEditor內容導出爲word（html轉word格式並下載）

dotnet C# 創建 X11 應用時設置窗口背景顏色

Navicat安裝與激活教程

TDengine docker安裝方法

vue3組件通信與props

sapui5

Alpine Linux apk add DNS lookup error

部分JDK版本的發佈時間

工作中用到的腳本合集

合併代碼時Beyond Compare設置

【爬蟲學的好，基礎少不了】：數據解析之BeautifulSoup4庫

每日爬蟲練習：瓜子二手車爬蟲信息的採集

【5分鐘力扣】06.Z字形變換

【python內功修煉009】：基於threading.Timer實現任務定時器

Python基礎： repr函數和str的區別

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結