前言：

應對AJAX動態加載，並應用表單的交互技術，爬取爬取拉勾網招聘信息，解析網頁返回的json數據，並將爬取的數據存儲於MongoDB數據庫中。

本文爲整理代碼，梳理思路，驗證代碼有效性——2020.2.2

環境：
Python3（Anaconda3）
PyCharm
Chrome瀏覽器

主要模塊： 後跟括號內的爲在cmd窗口安裝的指令
requests（pip install requests）
pymongo（pip install pymongo ）
json
time

1.

爬取目標url：https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=
看看python的相關職位。

2.

在源代碼中查看相關的信息，我們可以發現在網頁源代碼中沒有找到任何崗位的信息，由此可以推斷，該網頁是異步加載（AJAX）的。

3.

打開開發者工具F12，選擇Network選項，並點擊XHR文件，刷新一下網頁，依次查看各個文件返回的內容。

我們發現在第一個文件在中就返回了我們需要的信息，而且是json格式。

4.

解析json格式。由於該json格式比較複雜，所以我們先點擊到Preview查看，如圖。

那麼在content —》positionResult- --》result 路徑中，我們用python中的json庫對其解析。
注意：這裏得到的 results 是多組信息，後用循環對它進行分組拆分，詳見完整代碼。

html = requests.post(url, data=params, headers=headers, cookies=get_cookie(), timeout=5)

# 將網頁的Html文件加載爲json文件
json_data = json.loads(html.text)
# 解析json文件，後跟中括號爲解析的路徑
results = json_data['content']['positionResult']['result']

5.

到上一步，我們就已經獲取到了信息了，但是不滿足於此，我們還要爬取更多頁的信息，手動翻頁，發現url沒有變化，可判斷翻頁也是異步加載，同上，我們還是進行逆向工程，去XHR文件中找線索。
發現它是一個POST請求，頁數是由pn參數控制。

6.

發現返回的json數據中含有信息的總數，如圖。

拉勾網每頁有15條崗位信息，並默認只有30頁，那麼我們將返回的信息總數除以15看是否小於30，若小於，總頁數取對應結果，不然總頁數就等於30，代碼如下。

# 定義獲取頁數的函數
def get_page(url, params):
    html = requests.post(url, data=params, headers=headers, cookies=get_cookie(), timeout=5)

    # 將網頁的Html文件加載爲json文件
    json_data = json.loads(html.text)
    # 解析json文件，後跟中括號爲解析的路徑
    total_Count = json_data['content']['positionResult']['totalCount']

    page_number = int(total_Count/15) if int(total_Count/15) < 30 else 30

7.

爬取過程中出現如下錯誤，我起初同一位兄弟的想法一致，以爲是headers的參數不足，各種嘗試，後面把網頁裏的整個請求頭都加進去還是沒用。

這時有點難受了，開始百度，發現請求要添加相應的Cooke值才行，但是拉鉤網的的Cookie內含時間戳，即具有一次性，於是用下面的代碼獲取cookie完美解決問題。

# 獲取cookies值
def get_cookie():
    # 原始網頁的URL
    url = "https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput="
    s = requests.Session()
    s.get(url, headers=headers, timeout=3)  # 請求首頁獲取cookies
    cookie = s.cookies  # 爲此次獲取的cookies
    return cookie

參考資料,請移步至頁尾查看¹²³

完整代碼

# 導入相應的文件
import requests
import json
import time
import pymongo

# 連接數據庫
client = pymongo.MongoClient('localhost', 27017)

# 創建數據庫和數據集合
mydb = client['mydb']
lagou = mydb['lagou']

# 加入請求頭
headers = {
    "Accept": "application/json, text/javascript, */*; q=0.01",
    "Referer": "https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36",
}


# 獲取cookies值
def get_cookie():
    # 原始網頁的URL
    url = "https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput="
    s = requests.Session()
    s.get(url, headers=headers, timeout=3)  # 請求首頁獲取cookies
    cookie = s.cookies  # 爲此次獲取的cookies
    return cookie


# 定義獲取頁數的函數
def get_page(url, params):
    html = requests.post(url, data=params, headers=headers, cookies=get_cookie(), timeout=5)

    # 將網頁的Html文件加載爲json文件
    json_data = json.loads(html.text)
    # 解析json文件，後跟中括號爲解析的路徑
    total_Count = json_data['content']['positionResult']['totalCount']

    page_number = int(total_Count/15) if int(total_Count/15) < 30 else 30

    # 調用get_info函數，傳入url和頁數
    get_info(url, page_number)


# 定義獲取招聘信息函數
def get_info(url, page):
    for pn in range(1, page+1):
        # post請求參數
        params = {
            "first": "true",
            "pn": str(pn),
            "kd": "python"
        }

        # 獲取信息 並捕獲異常
        try:
            html = requests.post(url, data=params, headers=headers, cookies=get_cookie(), timeout=5)
            print(url, html.status_code)
            # 將網頁的Html文件加載爲json文件
            json_data = json.loads(html.text)
            # 解析json文件，後跟中括號爲解析的路徑
            results = json_data['content']['positionResult']['result']

            for result in results:
                infos = {

                    # positionName: "python"
                    #
                    # companyFullName: "深圳雲安寶科技有限公司"
                    # companySize: "15-50人"
                    # industryField: "信息安全,數據服務"
                    # financeStage: "A輪"
                    #
                    # firstType: "開發|測試|運維類"
                    # secondType: "後端開發"
                    # thirdType: "Python"
                    #
                    # positionLables: ["雲計算", "大數據"]
                    #
                    # createTime: "2020-02-02 12:51:01"
                    #
                    # city: "深圳"
                    # district: "南山區"
                    # businessZones: ["科技園"]
                    #
                    # salary: "15k-30k"
                    # workYear: "3-5年"
                    # jobNature: "全職"
                    # education: "本科"
                    #
                    # positionAdvantage: "地鐵口近 週末雙休"
                    
                    "positionName": result["positionName"],

                    "companyFullName": result["companyFullName"],
                    "companySize": result["companySize"],
                    "industryField": result["industryField"],
                    "financeStage": result["financeStage"],

                    "firstType": result["firstType"],
                    "secondType": result["secondType"],
                    "thirdType": result["thirdType"],

                    "positionLables": result["positionLables"],

                    "createTime": result["createTime"],

                    "city": result["city"],
                    "district": result["district"],
                    "businessZones": result["businessZones"],

                    "salary": result["salary"],
                    "workYear": result["workYear"],
                    "jobNature": result["jobNature"],
                    "education": result["education"],

                    "positionAdvantage": result["positionAdvantage"]
                }
                print(infos)
                # 插入數據庫
                lagou.insert_one(infos)
                # 睡眠2秒
                time.sleep(2)
        except requests.exceptions.ConnectionError:
            print("requests.exceptions.ConnectionError")
            pass


# 主程序入口
if __name__ == '__main__':
    url = "https://www.lagou.com/jobs/positionAjax.json"
    # post請求參數
    params = {
        "first": "true",
        "pn": 1,
        "kd": "python"
    }
    get_page(url, params)

爬蟲練習-爬取拉勾網招聘信息（2020.2.2）

前言：

1.

2.

3.

4.

5.

6.

7.

完整代碼

基於SSM框架的web入門項目(三)學習記錄

Pandas之unique和nunique傻傻分不清楚

筆趣閣爬蟲（2020重製版），貼心的操作，誰用誰知道

我安裝pyecharts時的曲曲折折

設置PyCharm背景圖片

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結