Abstract

本文是爬蟲項目第二篇，主要介紹了使用selenium爬取拉鉤數據分析職位招聘信息的過程。

Selenium

Selenium是一款python操作瀏覽器的庫，多用於自動測試，其原理就是通過編程自動控制瀏覽器，無需詳細填寫瀏覽器header等信息，也無需查找分析JS渲染等內容，即Chrome的Elements中的內容可見即可爬，需要下載Chromedriver等工具用於驅動瀏覽器，本文使用Chrome，也可以使用其他瀏覽器。

HOW

本文使用python的selenium、BeautifulSoup、pandas等包爬取了拉鉤搜索“數據分析”崗位所得的職位信息，並存入csv中。具體步驟：

response獲取。

在拉鉤主頁搜索“數據分析”，定位爲全國，取得url，構建bronser對象，使用bronser打開url

from selenium import webdriver
from bs4 import BeautifulSoup
import time
import pandas as pd

bronser = webdriver.Chrome()
bronser.get('https://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?px=default&city=%E5%85%A8%E5%9B%BD#order')

url解析。

第一層是一個循環，用於翻頁。

對於每一頁，首先使用BeautifulSoup對html解析，使用soup.find()方法找到崗位列表，並使用bronser找到下一頁按鈕並構建對象。

然後對崗位列表中的每一條崗位的崗位信息進行提取，並存入job_info列表中，然後將每條崗位信息列表存入job列表中。每一頁的數據爬取之後，判斷下一步按鈕是否可以執行，如果可以則使用button.click()點擊下一步，並且未防止出現驗證，停留10s後繼續運行

job = []
while True:
    soup = BeautifulSoup(bronser.page_source)
    list_of_position = soup.find('div',class_='s_position_list').find('ul').find_all('li')
    next_button = bronser.find_element_by_class_name('pager_next ')
    
    for i in list_of_position:
        company = i.get('data-company')
        companyid = i.get('data-companyid')
        hrid=i.get('data-hrid')
        positionid=i.get('data-positionid')
        positionname=i.get('data-positionname')
        salary=i.get('data-salary')
        tpladword=i.get('data-tpladword')
        location = i.find('em').string
        hr_position = i.find('input',class_='hr_position').get('value')
        position_tag = i.find('div',class_='li_b_l').text.split('\n')[-2]
        experience = position_tag.split('/')[0]
        education = position_tag.split('/')[1]
        company_tag = i.find('div',class_='industry').text.strip().split('/')
        industry = company_tag[0]
        financing = company_tag[1]
        company_scale = company_tag[2]
        position_describe = i.find('div',class_='list_item_bot').find('span').text
        company_describe = i.find('div',class_='list_item_bot').find('div',class_="li_b_r").text
        job_info = [positionid,positionname,company,companyid,hrid,hr_position,salary,tpladword,
                    location,experience,education,industry,financing,company_scale,position_describe,company_describe]
        job.append(job_info)
        
    if 'pager_next_disabled' in next_button.get_attribute('class'):
        break
    next_button.click()
    time.sleep(10)

數據存儲

這裏不同以往，在解析網頁時，直接將數據存入列表中而不是寫入文件，最終獲得所有數據的二維列表，然後使用pandas轉爲DataFrame(數據表）df並使用df.rename()重命名列名後使用df.to_csv()導出爲csv文件。

df = pd.DataFrame(job)
columns = ['positionid','positionname','company','companyid','hrid','hr_position','salary','tpladword','location','experience','education','industry','financing','company_scale','position_describe','company_describe']
df.rename(columns=dict(enumerate(columns)),inplace=True)
df.to_csv('拉鉤招聘.csv')
bronser.close()

Result

Summary

本次爬取過程相對比較順利，只是在開始時由於未設定time.sleep()導致彈出登錄框，之後嘗試了1s、3s、5s分別在不同頁面彈出登錄框，最終使用10s獲得了數據。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

selenium爬取拉鉤數據分析工作招聘內容

Abstract

Selenium

HOW

Result

Summary

機器學習基礎之線性模型

selenium爬取拉鉤數據分析工作招聘內容

數據科學競賽入門——精品旅行服務成單預測

機器學習基礎之一文讀懂決策樹

機器學習基礎之K近鄰

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結