使用xpath爬取壁紙圖片

這是一個爬取壁紙的爬蟲，網址：http://www.win4000.com/wallpaper_205_0_10_1.html
爬取過程：

1.打開網址，是套圖，所以先收集套圖的跳轉鏈接，以套圖的名稱創建一個TXT文本
2.通過跳轉鏈接，收集每套圖的圖片鏈接，並寫道對應的TXT文本中
3.遍歷文件夾下所有TXT文件，下載裏面的圖片，每套圖分別保存到不同的文件下

涉及技術：

自定義本機header
使用Xpath解析網頁
常見爬蟲流程

各種網頁解析方案可以看這：https://developer.51cto.com/art/201912/608581.htm，介紹正則表達式、BS4、XPath以及requests-html

1.自定義header

F12查看網頁代碼找到network（如果沒有內容，F5刷新）

隨意找一個文件，右擊它，如圖示，點擊Copy as cURL(bash)

訪問網站：https://curl.trillworks.com/

參考：https://segmentfault.com/a/1190000019926385

2.使用xpath

xpath，可以抽象成windows文件系統一樣，通過路徑尋找標籤，推薦插件：XPath Helper（用來驗證你寫語句）
找到你要的元素並複製XPath，如圖所示：

插件使用如圖所示：

XPath語法可以看這：https://www.w3school.com.cn/xpath/xpath_syntax.asp

3.附源碼（流程都在註釋裏了）

# coding=utf-8
import os
import sys
import time
import requests
from lxml import etree

class Reptile:
    "爬蟲類：獲取圖片的URL，保存圖片"
    def __init__(self):
        super().__init__()
        self.base_url = "http://www.win4000.com/wallpaper_205_0_10_1.html"
        self.headers = { # 自定義請求頭
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'}
        self.image_url_dir = os.path.join(sys.path[0], "image_url") # 圖片鏈接保存的目錄
        self.image_dir = os.path.join(sys.path[0], "image") # 圖片下載保存的目錄

    def make_dir(self, dir_name):
        # 判斷文件是否存在，不存在就新建一個
        if not os.path.exists(dir_name):
            os.makedirs(dir_name)
    
    def get_html(self, URL):
        # 發出請求，獲取網頁，設置網絡延時限定:connect=6.05, read=30
        try:
            resp = requests.get(url=URL, headers=self.headers, timeout=(6.05, 30)) # 發出請求
            if resp.status_code == 200:
                return resp
        except Exception as e:
            print("地址({0})出錯：{1}".format(URL, e))

    
    def get_url(self):
        self.make_dir(self.image_url_dir) # 創建文件夾
        # 獲取圖片的URL
        resp = self.get_html(self.base_url) # 發出請求
        resp_html = etree.HTML(resp.text) # 獲取頁面數據
        jump_urls = resp_html.xpath('./body/div[4]/div/div[3]/div/div/div/div/div/ul/li/a/@href') # 跳轉地址
        jump_titles = resp_html.xpath('./body/div[4]/div/div[3]/div/div/div/div/div/ul/li/a/@title') # 套圖題目 
        print("當前頁面套圖數：", len(jump_urls), len(jump_titles))

        # 每個套圖的跳轉地址，逐一訪問，保存所有的圖片網址
        for index in range(len(jump_urls)):
            url_exect = jump_urls[index]
            file_name = os.path.join(self.image_url_dir, jump_titles[index]) + ".txt" # 保存當前套圖的文件名
            print("index:", index + 1, " ", jump_urls[index], jump_titles[index])
            
            # 尋找圖片鏈接
            num = 1
            sign = True
            with open(file_name, "a", encoding="utf-8") as output:
                while sign:
                    detail = self.get_html(url_exect) # 跳轉的網頁
                    detail_html = etree.HTML(detail.text)
                    # 當前網頁的圖片的標題
                    title = detail_html.xpath('./body/div[4]/div/div[2]/div/div[2]/div/div[@class="pic-meinv"]/a/img/@title')[0]
                    # print("image{}:{}".format(num, title))

                    # 通過title判斷是不是同一套圖的內容
                    if title == jump_titles[index]:
                        url_exect = detail_html.xpath('./body/div[4]/div/div[2]/div/div[2]/div/div[@class="pic-meinv"]/a/@href')[0] # 下一張圖片所在網址
                        image = detail_html.xpath('./body/div[4]/div/div[2]/div/div[2]/div/div[@class="pic-meinv"]/a/img/@src') # 圖片資源
                        output.write(image[0])
                        output.write("\n")
                    else:
                        sign = False
                    num += 1
            index += 1

    def down_image(self):
        # 下載圖片
        if not os.path.exists(self.image_url_dir):
            print("image_url文件夾不存在")
            return
        
        files = os.listdir(self.image_url_dir)
        if len(files) == 0:
            print("image_url文件夾內不存在文件")
            return
        
        self.make_dir(self.image_dir) # 創建一個用於保存圖片的文件夾image
        
        # 讀取文件，並下載
        for file in files:
            image_dir = os.path.join(self.image_dir, file[:-4])
            self.make_dir(image_dir) # 每一類圖片單獨創建一個文件夾
            with open(os.path.join(self.image_url_dir, file), "r", encoding="utf-8") as input:
                rows = input.readlines()
                # print(rows)
                index = 1
                for row in rows:
                    print("正在下載：{}第{}張圖片……".format(file[:-4], index))
                    image_name = os.path.join(image_dir, str(index) + ".jpg")                
                    image = self.get_html(row[:-1]) # 獲取圖片
                    with open(image_name, "wb+") as output:
                        output.write(image.content)
                    index += 1


if __name__ == '__main__':
    reptile = Reptile()

    t = time.time()
    reptile.get_url() # 獲取圖片鏈接
    print(f"get_url used time:{time.time() - t}")
    # get_url used time:101.54531002044678

    t = time.time()
    reptile.down_image() # 下載圖片
    print(f"down_image used time:{time.time() - t}")
    # down_image used time:604.7457594871521

爬蟲參考：https://mp.weixin.qq.com

使用xpath爬取壁紙圖片

1.自定義header

2.使用xpath

3.附源碼（流程都在註釋裏了）

Golang爬蟲代理接入的技術與實踐

Android項目解析本地Json文件

Android添加room依賴的正確姿勢（附帶完整流程）

使用xpath爬取壁紙圖片

Android項目解析CSV文件策略

初識 pytorch 分類問題，梳理流程

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結