首先登陸以下網址進入騰訊招聘網https://hr.tencent.com/
首先分析以下網頁數據加載的方式,是json數據還是動態數據或者是靜態?
看下network裏面抓到的動態數據是否有哪些有用的東西
發現抓到的json數據裏面沒有傳輸任何數據
頁面請求也沒有返回任何有關崗位的信息
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"}
url = "https://hr.tencent.com/position.php?keywords=python"
response = requests.get(url, headers=headers, verify=False).text
print(response)
執行結果如下
"C:\Program Files\Python36\python.exe" C:/Users/40122/Desktop/demo_py3/day03/oop.py
C:\Program Files\Python36\lib\site-packages\urllib3\connectionpool.py:857: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
<link media="screen" href="//cdn.m.tencent.com/hr_static/css/all.css?max_age=86412" type="text/css" rel="stylesheet" />
<script type="text/javascript" src="//cdn.m.tencent.com/hr_static/js/jquery-1.7.2.min.js"></script>
<script type="text/javascript" src="//cdn.m.tencent.com/hr_static/js/jquery-ui-1.7.2.custom.min.js"></script>
<script type="text/javascript" src="//cdn.m.tencent.com/hr_static/js/thickbox.js"></script>
<link media="screen" href="//cdn.m.tencent.com/hr_static/css/thickbox.css" type="text/css" rel="stylesheet" />
<script type="text/javascript" src="//cdn.m.tencent.com/hr_static/js/functions.js"></script>
<script type="text/javascript" src="//cdn.m.tencent.com/hr_static/js/utils.js"></script>
<script language="javascript" src="//vm.gtimg.cn/tencentvideo/txp/js/txplayer.js" charset="utf-8"></script>
<div id="header">
<div class="maxwidth">
<a href="index.php" class="left" id="logo"><img src="//cdn.m.tencent.com/hr_static/img/logo.png"/></a>
<div class="right" id="headertr">
<div class="right pl9" id="topshares">
<div class="shares">
<span class="left">分享到:</span>
<!--<a href="javascript:;" onclick="shareto('qqt','top');" id="qqt" title="分享到騰訊微博">分享到騰訊微博</a>-->
<a href="javascript:;" onclick="shareto('qzone','top');" id="qzone" title="分享到QQ空間">分享到QQ空間</a>
<!--<a href="javascript:;" onclick="shareto('pengyou','top');" id="pengyou" title="分享到騰訊朋友">分享到騰訊朋友</a>-->
<a href="javascript:;" onclick="shareto('sinat','top');"id="sinat" title="分享到新浪微博">分享到新浪微博</a>
<!--<a href="javascript:;" onclick="shareto('renren','top');"id="renren" title="分享到人人網">分享到人人網</a>-->
<!--<a href="javascript:;" onclick="shareto('kaixin001','top');"id="kaixin" title="分享到開心網">分享到開心網</a>-->
<div class="clr"></div>
</div>
<!--<a href="javascript:;">分享</a>-->
</div>
<!--<div class="right pl9">-->
<!--<a href="http://t.qq.com/QQjobs" id="tqq" target="_blank">收聽騰訊招聘</a>-->
<!--</div>-->
<div class="right pr9">
<a href="login.php" id="header_login_anchor">登錄</a><span class="plr9">|</span><a href="reg.php">註冊</a>
<span class="plr9">|</span><a href="question.php">反饋建議</a>
<span class="plr9">|</span><a href="http://careers.tencent.com/global" target="_blank">Tencent Global Talent</a>
<script>
var User_Account = "";
</script>
</div>
<div class="clr"></div>
</div>
<div class="clr"></div>
</div>
<div id="menus">
<div class="maxwidth">
<ul id="menu" class="left">
<li id="nav1" ><a href="index.php"> </a></li>
<li id="nav2" class="active" ><a href="social.php"> </a></li>
<li id="nav3"><a href="about.php"> </a></li>
<li id="nav4"><a href="workInTencent.php"> </a></li>
</ul>
<a class="right texti9" target="_blank" id="navxy" href="http://join.qq.com">校園招聘</a>
<div class="clr"></div>
</div>
</div>
<div id="homeDep"><table id="homeads"><tr><td align="center"><a href="http://tencent.avature.net/career" target="blank">全球招聘</a></td><td align="center"><a href="http://game.qq.com/hr/" target="blank">互動娛樂事業羣招聘</a></td><td align="center"><a href="http://hr.tencent.com/position.php?lid=&tid=&keywords=WXG" target="blank">微信事業羣招聘</a></td><td align="center"><a href="http://hr.qq.com/" target="blank">技術工程事業羣招聘</a></td><td align="center"><a href="http://snghr.tencent.com" target="blank">社交網絡事業羣招聘</a></td><td align="center"><a href="http://mighr.qq.com" target="blank">移動互聯網事業羣招聘</a></td><td align="center"><a href="http://hr.tencent.com/position.php?keywords=OMG" target="blank">網絡媒體事業羣招聘</a></td></tr></table></div> <div id="footer">
<div>
<a href="http://www.tencent.com/" target="_blank">關於騰訊</a><span>|</span><a href="http://www.qq.com/contract.shtml" target="_blank">服務條款</a><span>|</span><a href="http://hr.tencent.com/" target="_blank">騰訊招聘</a><span>|</span><a href="http://careers.tencent.com/global" target="_blank">Tencent Global Talent</a><span>|</span><a href="http://gongyi.qq.com/" target="_blank">騰訊公益</a><span>|</span><a href="http://service.qq.com/" target="_blank">客服中心</a>
</div>
<p>Copyright © 1998 - 2018 Tencent. All Rights Reserved.</p>
</div>
</html>
Process finished with exit code 0
頁面樣式中看到數據都放tbody這個樣式裏面
再用bs4或者正則過濾下
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"}
url = "https://hr.tencent.com/position.php?keywords=python"
response = requests.get(url, headers=headers, verify=False).text
soup = BeautifulSoup(response, 'lxml')
print(soup.find_all('table', class_='tablelist'))
數據一下子展示出來了,有點興奮。再用bs4過濾下
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"}
url = "https://hr.tencent.com/position.php?keywords=python"
response = requests.get(url, headers=headers, verify=False).text
soup = BeautifulSoup(response, 'lxml')
jobList = soup.find_all('tr', class_=['even', 'odd'])
for job in jobList:
# tr:nth-child(2) > td.l.square > a
# 崗位名
jobName = job.select("td:nth-of-type(1) > a")[0].text
# url
joburl = "https://hr.tencent.com/" + job.select("td:nth-of-type(1) > a")[0]['href']
# 類型
jobType = job.select("td:nth-of-type(2)")[0].text
# 人數
jobnum = job.select("td:nth-of-type(3)")[0].text
# 地點
jobAddr = job.select("td:nth-of-type(4)")[0].text
print(jobName, joburl, jobType, jobnum, jobAddr)
第一頁的數據已經抓下,
進去鏈接獲取下崗位信息
可以通過嵌套一個requests請求獲取到這些信息
再看下能獲取到的其他信息
在點擊頁數信息得知是靜態加載的數據,因爲url都改變了。
還有需要知道總頁數,這樣循環即可抓到每個頁面。network裏面找不到其他相關的信息,發現每個頁面信息都展示10條信息,分別在even和odd中
這下試下找招聘總數,這樣就能獲取到頁數信息了
職位數量在一個span的標籤中
開始寫代碼
import requests
from bs4 import BeautifulSoup
import math
def getJobOrder(url):
'''
獲取崗位要求
:return:
'''
response = requests.get(url, headers=headers,verify=False).text
soup = BeautifulSoup(response, 'lxml')
# 崗位職責
# jobRes = soup.select('ul[class="squareli"]')
jobRes = soup.select("ul.squareli")[0].text
jobOrder = soup.select(".squareli")[1].text
# print(jobRes)
# print("=====",jobOrder)
return jobRes, jobOrder
def getJobInfo(url):
'''
獲取崗位信息
:return:
'''
# url = "https://hr.tencent.com/position.php?lid=2218&tid=87&keywords=python&start=10#a"
response = requests.get(url, headers=headers,verify=False).text
# print(response)
soup = BeautifulSoup(response, 'lxml')
# job = soup.find_all('table', class_="tablelist")
jobList = soup.find_all('tr', class_=["even", 'odd']) # 或 [,]匹配所有符合條件的屬性
for job in jobList:
# tr:nth-child(2) > td.l.square > a
# 崗位名
jobName = job.select("td:nth-of-type(1) > a")[0].text
# url
joburl = "https://hr.tencent.com/" + job.select("td:nth-of-type(1) > a")[0]['href']
# 類型
jonResAndOrder = getJobOrder(joburl)
# 職責
jobRes = jonResAndOrder[0]
# 要求
jobOrder = jonResAndOrder[1]
jobType = job.select("td:nth-of-type(2)")[0].text
# 人數
jobnum = job.select("td:nth-of-type(3)")[0].text
# 地點
jobAddr = job.select("td:nth-of-type(4)")[0].text
print(jobName, joburl, jobType, jobnum, jobAddr)
print(jobRes, jobOrder)
def getJobPageNum(url):
'''
獲取崗位頁數
:return:
'''
response = requests.get(url, headers=headers,verify=False).text
soup = BeautifulSoup(response, 'lxml')
num = soup.select('span[class="lightblue total"]')[0].text
print(num)
return int(num)
if __name__ == '__main__':
# getJobOrder("https://hr.tencent.com/position_detail.php?id=41998&keywords=python&tid=87&lid=2218")
# getJobInfo()
# 種子url
url = "https://hr.tencent.com/position.php?keywords=python"
pageNum = getJobPageNum(url)
# .ceil取上整
num = math.ceil(pageNum / 10)
for i in range(num):
url = "https://hr.tencent.com/position.php?keywords=python&start=%d#a" % (i * 10)
getJobInfo(url)
運行如下
如需要保存寫入一個文件即可。
版權聲明:本文爲博主原創文章,未經博主允許不得轉載。https://my.csdn.net/pangzhaowen