再爬拉鉤，直接忽略反爬！Selenium+Xpath+re 可見可爬

原創

2020-03-10 05:24

再爬拉鉤，直接忽略反爬！Selenium+Xpath+re 可見可爬

之前我寫過一篇博python成功爬取拉勾網——初識反爬（一個小白真實的爬取路程，內容有點小多）這是我第一次對具有多種反爬措施的網站進行的爬取，之前爬取的大都是簡單的定點靜態網頁爬取（練習簡單的python網絡爬蟲庫的使用），所以遇到了一大波挫折，看了很多大佬的博客後才勉強解決，當然現在已經可以很好的理解拉鉤數據加載的方式（Ajax動態加載數據）和反爬措施有很好的瞭解啦😁

最近學習了Selenium自動化測試庫，就想嘗試的用這種方法對拉鉤再次進行爬取，體會其中的不同，使用這種庫當然是因爲它模擬了瀏覽器的瀏覽和人爲的點擊輸入，不需要對網頁請求響應的分析，更不需要構造頭部請求，所以便想試一試。因爲不需要過多的分析，所以直接上代碼！

導入需要的庫：

from selenium import webdriver
import time
import lxml
from lxml import etree
import re

在主方法中進行模擬瀏覽器瀏覽和點擊，若是放在某一個方法中可能會出現可以模擬瀏覽成功，但可能出現秒退出的結果：

if __name__ == '__main__':
    url = 'https://www.lagou.com/'
    #login(url)
    #初始化瀏覽器
    driver = webdriver.Chrome()
    #跳轉到目標網頁
    driver.get(url)

有圖可以看到，網頁跳轉後彈出了一個城市選擇框，影響了我們對網頁源碼額獲取，所以我先找到關閉按鈕的源碼，找到他並模擬點擊，關閉彈出窗口：

#獲取關閉彈出框的按鈕
#<button type="button" id="cboxClose">close</button>
button_close = driver.find_element_by_id('cboxClose')
#關閉彈出窗口
button_close.click()

這樣就可關閉彈出窗口啦，接下來就是獲取輸入框，向輸入框中輸入自己想要查詢的關鍵字，點擊搜索按鈕：

根據圖中所示的Element元素的關鍵屬性進行對按鈕和輸入框的鎖定：

#等待1秒，網頁源碼的響應
time.sleep(1)
#keywards = input('請輸入你想查找的職位信息：')
input = driver.find_element_by_id('search_input')
input.send_keys('python網絡爬蟲')
button_search = driver.find_element_by_id('search_button')
button_search.click()

這樣就完成了對關鍵字的搜索，當瀏覽器自動話打開後，又有出現的無關的彈窗，還是對其分析，將彈框關閉，可能每個人遇到的情況不同，可能沒有彈框出現，所以我在此就不放效果截圖了，直接給出我關閉彈窗的代碼，之後獲取當前的網頁源碼：

#<div class="body-btn">給也不要</div>
button_btn = driver.find_element_by_class_name('body-btn')
button_btn.click()
time.sleep(1)
page_source = driver.page_source

最後對你想要的信息分析即可，這裏我才用了re和Xpath的方法，爲了對二者加強練習：

可以看出每個職位的所有信息都放在div標籤當中:

def search_information(page_source):
    tree = etree.HTML(page_source)
    #<h3 style="max-width: 180px;">網絡爬蟲工程師</h3>
    position_name = tree.xpath('//h3[@style="max-width: 180px;"]/text()')
    #<span class="add">[<em>北京·小營</em>]</span>
    position_location = tree.xpath('//span[@class="add"]/em/text()')
    #<span class="format-time">17:15發佈</span>
    position_report_time = tree.xpath('//span[@class="format-time"]/text()')
    #<span class="money">8k-15k</span>
    positon_salary = tree.xpath('//span[@class="money"]/text()')
    #position_edution = tree.xpath('//div[@class="li_b_l"]/text()')
    position_edution = re.findall('<div.*?class="li_b_l">(.*?)</div>',str(page_source),re.S)
    position_result_edution = sub_edution(position_edution)
    position_company_name = tree.xpath('//div[@class="company_name"]/a/text()')
    position_company_href = tree.xpath('//div[@class="company_name"]/a/@href')
    position_company_industry = tree.xpath('//div[@class="industry"]/text()')
    position_company_industry_result = sub_industry(position_company_industry)
    #<div class="li_b_r">“免費早午餐+免費班車+五險兩金+年終獎”</div>
    position_good = tree.xpath('//div[@class="li_b_r"]/text()')

    for i in range(len(position_company_name)):
        print("職位名稱：{}".format(position_name[i]))
        print("公司位置：{}".format(position_location
                               [i]))
        print("信息發佈時間：{}".format(position_report_time[i]))
        print("職位薪資：{}".format(positon_salary[i]))
        print("職位要求：{}".format(position_result_edution[i]))
        print("公司名稱：{}".format(position_company_name[i]))
        print("公司規模：{}".format(position_company_industry_result[i]))
        print("公司福利:{}".format(position_good[i]))
        print("公司鏈接：{}".format(position_company_href[i]))
        print('-----------------------------------------------------------------------')

對正則表達式返回來的內容含有空格符進行處理：

def sub_edution(list):
    a =[]
    result = []
    for i in list:
        one = re.sub('\n', '', i)
        two = re.sub(' <span.*?>.*?</span>', '', one)
        three = re.sub('<!--<i></i>-->', '', two)
        a.append(three)
    for i in a[::2]:
        result.append(i)
    return result

def sub_industry(list):
    result = []
    for i in list:
        a = re.sub('\n','',i)
        result.append(a)
    return result

最後後臺打印的結果：
在此我只對一頁進行了爬取，共有30頁的信息，兄的們可以自己試一試多頁爬取，很簡單的，觀察一下不同頁數URL鏈接的不同即可

謝謝大家的閱讀😊

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

再爬拉鉤，直接忽略反爬！Selenium+Xpath+re 可見可爬

再爬拉鉤，直接忽略反爬！Selenium+Xpath+re 可見可爬

ollama使用

Window 安裝 Python 失敗 0x80070643，發生嚴重錯誤

TiDB Vector 太香啦：以圖搜圖初體驗！

《最新出爐》系列入門篇-Python+Playwright自動化測試-41-錄製視頻

Python多線程爬蟲—批量爬取豆瓣電影動態加載的電影信息（小白詳細說明自己對於多線程瞭解）

Python幫你玩轉Excel文檔之xlwt模塊創建Excel文檔（基本操作）

（2020年）解決報錯：SyntaxError: Non-UTF-8 code starting with '\xe6' in file

Python幫你玩轉Excel文檔之xlrd模塊的基本詳細操作

Python—Queue模塊基本使用方法詳解

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結