Python爬蟲：春招在即你準備好租房了嗎？

原創

Fantasy!

2020-06-21 10:02

一、前言

最近在做一個數據分析的項目主要討論招聘市場和Python語言的就業環境以及租房，已經完成，但現在還不能公開，租房分析時需要數據，所以今天先把租房爬蟲代碼分享給大家！

二、源碼

目標網站：房天下
注意事項：如果直接使用 requests 發送請求但話，是得不到想要的數據，網站會自動跳轉，所以我們需要先提取真正的地址，然後再向真正網址發送請求然後解析數據即可
使用介紹：

首先我們先去目標網站（房天下）選擇租房，然後按照需求設定條件比如地鐵線路或者價格區間
然後點擊第二頁、記錄總頁碼（或者你想爬多少頁）然後複製URL（不要是第一頁的url）比如下面是一個深圳租房6000元以下的租房信息第二頁的url：https://sz.zu.fang.com/house1-j079/i32/ 你會發現後面這部分：/i32/ 然後再點擊第三頁：https://sz.zu.fang.com/house1-j079/i33/ 發現url結尾是：/i33/ 規律很明顯了吧！/i3 後面的數字就是頁碼，接下來我們使用python中字符串格式化語法 %s 將它修改即可，例如：
```
url = "https://sz.zu.fang.com/house1-j034/i3%s/"
page_num = 10
print([url % x for x in range(1, page_num+1)])
```
輸出結果如下：

[‘https://sz.zu.fang.com/house1-j034/i31/’,
‘https://sz.zu.fang.com/house1-j034/i32/’,
‘https://sz.zu.fang.com/house1-j034/i33/’,
‘https://sz.zu.fang.com/house1-j034/i34/’,
‘https://sz.zu.fang.com/house1-j034/i35/’,
‘https://sz.zu.fang.com/house1-j034/i36/’,
‘https://sz.zu.fang.com/house1-j034/i37/’,
‘https://sz.zu.fang.com/house1-j034/i38/’,
‘https://sz.zu.fang.com/house1-j034/i39/’,
‘https://sz.zu.fang.com/house1-j034/i310/’]

所以我們要將頁面處的/i3**/ 改爲 %s 使用列表生成式即可完成，例如（爲什麼要這麼麻煩呢，因爲這個網站url一旦出錯就會讓你輸入驗證碼，所以url要嚴格一點，解決辦法也要但是我們又需要的數據不是很多，沒必要再改，只需要自己把url準備好和頁面，可以改爲隊列，注意：網站同一IP地址爬取3000條左右就需要輸入驗證碼）：

https://sz.zu.fang.com/house1-j079/i32/ # 原url第二頁
https://sz.zu.fang.com/house1-j079/i3%s/ # 修改後的格式化字符串

完成url、和頁面的配置就可以使用代碼進行爬蟲：

import re
import requests
import pandas as pd
from lxml import etree


class SpiderApartment(object):
    def __init__(self, file_path, url, page_num):
        self.file_path = file_path
        self.url = [url % x for x in range(1, page_num+1)]

    @staticmethod
    def get_true_url(url):
        """
        獲取跳轉的url
        :param url:
        :return:
        """
        response = requests.get(url)
        result = etree.HTML(response.text)
        temp_url = result.xpath('//a[@class="btn-redir"]/@href')[0]
        return temp_url

    @staticmethod
    def get_title(etree_obj):
        title = etree_obj.xpath("//div['houseList xh-highlight']/dl//p['title'][1]/a")
        return [i.text for i in title]

    @staticmethod
    def get_address(etree_obj):
        """
        提取詳細地址
        :param etree_obj:
        :return:
        """
        temp_list = list()
        result2 = etree_obj.xpath("//div['houseList xh-highlight']/dl//p['title'][3]//span")
        for x in range(0, len(result2), 3):
            temp_list.append("-".join([i.text for i in result2[x: x + 3]]))
        return temp_list

    @staticmethod
    def get_line(etree_obj):
        """
        獲取地鐵線路名
        :param etree_obj:
        :return:
        """
        result = etree_obj.xpath("//div['houseList xh-highlight']/dl//p['title'][1]/span[@class='note subInfor']")
        return [str(i.text).replace("距", "") for i in result]

    @staticmethod
    def get_room_type(etree_obj):
        """
        獲取房屋的出租類型
        :param etree_obj:
        :return:
        """
        result = etree_obj.xpath("//div['houseList xh-highlight']/dl//p[@class='font15 mt12 bold']/text()[1]")
        return [re.sub("\n\t", "", i).strip() for i in result]

    @staticmethod
    def get_room_scale(etree_obj):
        """
        獲取房間的信息幾室幾庭
        :param etree_obj:
        :return:
        """
        result = etree_obj.xpath("//div['houseList xh-highlight']/dl//p[@class='font15 mt12 bold']/text()[2]")
        return [re.sub("\n\t", "", i).strip() for i in result]

    @staticmethod
    def get_room_size(etree_obj):
        """
        獲取房間的大小
        :param etree_obj:
        :return:
        """
        result = etree_obj.xpath("//div['houseList xh-highlight']/dl//p[@class='font15 mt12 bold']/text()[3]")
        return [str(re.sub("\n\t", "", i).strip()).split("�")[0] for i in result]

    @staticmethod
    def get_room_direction(etree_obj):
        """
        獲取房間的朝向
        :param etree_obj:
        :return:
        """
        result = etree_obj.xpath("//div['houseList xh-highlight']/dl//p[@class='font15 mt12 bold']/text()[4]")
        return [str(re.sub("\n\t", "", i).strip()).split("�")[0] for i in result]

    @staticmethod
    def get_price(etree_obj):
        """
        獲取房間的價格
        :return:
        """
        result = etree_obj.xpath("//div['houseList xh-highlight']/dl//p[@class='mt5 alingC']/span")
        return [i.text for i in result]

    def run(self):
        for x in self.url:
            print(x)
            response = requests.get(self.get_true_url(x))
            result = etree.HTML(response.text)
            df = pd.DataFrame()
            df["title"] = self.get_title(result)  # 標題
            df["address"] = self.get_address(result)  # 地址
            df["line"] = self.get_line(result)  # 地鐵線路
            df["room_type"] = self.get_room_type(result)  # 房間類型
            df["room_scale"] = self.get_room_scale(result)  # 房間規模
            df["room_size"] = self.get_room_size(result)  # 房間面積
            # df["room_direction"] = self.get_room_direction(result)  # 房間朝向
            df["price"] = self.get_price(result)  # 房間價格
            try:
                df.to_csv(self.file_path, encoding="utf_8_sig", mode="a", header=False)
            except FileNotFoundError:
                df.to_csv(self.file_path, encoding="utf_8_sig")


if __name__ == '__main__':
    name = "./深圳.csv"
    url = "https://sz.zu.fang.com/house1-j034/i3%s/"
    page = 100
    spider = SpiderApartment(name, url, page)
    spider.run()

使用方法：按照剛纔的步驟準備好字符串格式化後的 url 和頁面分別修改 url、page 保存文件名按照需求修改，保存數據使用的是 pandas.to_csv mode=“a” 意思是將數據追加到一個文件裏

三、爬蟲過程展示

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python爬蟲：春招在即你準備好租房了嗎？

一、前言

二、源碼

三、爬蟲過程展示

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

2020年上半年數據庫系統工程師考試

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

MySQL 參數筆記：參數彙總(持續更新)

Django必知必會（一）相關配置介紹及動態URL

數據結構與算法-Python實現（一）大O表示法

CentOS安裝MySQL和Xtrabackup(熱備工具)

Go語言環境搭建，及Goland下載相關（Hello！Golang）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結