鏈家網獲得數據地址,高德api獲得經緯度(同理鏈家網也可以換成其他58同城,趕集網的數據,因爲反爬蟲比較厲害,沒時間整,就用鏈家網的數據先試試水)
首先爬鏈家網,Info包含一條信息
import json
import requests
from bs4 import BeautifulSoup
import re,sys
from fake_useragent import UserAgent
import importlib
importlib.reload(sys)
pro=['220.175.144.55:9999']
ua = UserAgent()
for i in range(1,2):
# 循環構造url
url = 'http://hz.lianjia.com/ershoufang/pg{}/'
k = url.format(i)
# 添加請求頭,否則會被拒絕
headers = {'Referer': 'https://hz.lianjia.com/ershoufang/',
'user-agent':ua.random}
res = requests.get(k, headers=headers)
# 基於正則表達式來解析網頁內容,拿到所有的詳情url
# 原始可能是這麼做的,但是後來發現bs4給我們提供了更方便的方法來取得各元素的內容
# 正則表達式最重要的兩個東西,.任意匹配字符,*匹配任意次數,?以html結束
text = res.text
re_set = re.compile('https://hz.lianjia.com/ershoufang/[0-9]*.?html')
re_get = re.findall(re_set,text)
#去重
lst2 = {}.fromkeys(re_get).keys()
for name in lst2:
res = requests.get(name, headers=headers)
info = {}
text2 = res.text
soup = BeautifulSoup(text2, 'html.parser')
info['地址'] = soup.select('.main')[0].text
info['總價'] = soup.select('.total')[0].text
info['每平方售價'] = soup.select('.unitPriceValue')[0].text
info['小區名稱'] = soup.select('.info')[0].text
info['所在區域'] = soup.select('.info a')[0].text + ':' + soup.select('.info a')[1].text
然後調用高德api,你肯定要申請key,從而使用高德的服務
進入高德開發官網,註冊賬號啥的
然後創建應用
創建成功後就能得到一個key,名稱隨意
然後獲取完數據,大致是這樣的
取紅框數據獲取經緯度
# 根據地址獲取對應經緯度,通過高德地圖的api接口來進行
mc = soup.select('.info')[0].text
location1 = '杭州' + mc
# print(location1)
base = 'https://restapi.amap.com/v3/geocode/geo?key=3e176b0540a337b449930fc4c12cab11&address='+location1
response = requests.get(base)
result = json.loads(response.text)
info['經緯度']=result['geocodes'][0]['location']
print(info)
with open('G:/新建文件夾/pc/image/a.csv', 'a', encoding='utf-8')as data:
print(str(info), file=data)
下面是全部代碼,比較簡單就不寫函數封裝了
import json
import requests
from bs4 import BeautifulSoup
import re,sys
from fake_useragent import UserAgent
import importlib
importlib.reload(sys)
pro=['220.175.144.55:9999']
ua = UserAgent()
for i in range(1,2):
# 循環構造url
url = 'http://hz.lianjia.com/ershoufang/pg{}/'
k = url.format(i)
# 添加請求頭,否則會被拒絕
headers = {'Referer': 'https://hz.lianjia.com/ershoufang/',
'user-agent':ua.random}
res = requests.get(k, headers=headers)
# 基於正則表達式來解析網頁內容,拿到所有的詳情url
# 原始可能是這麼做的,但是後來發現bs4給我們提供了更方便的方法來取得各元素的內容
# 正則表達式最重要的兩個東西,.任意匹配字符,*匹配任意次數,?以html結束
text = res.text
re_set = re.compile('https://hz.lianjia.com/ershoufang/[0-9]*.?html')
re_get = re.findall(re_set,text)
#去重
lst2 = {}.fromkeys(re_get).keys()
for name in lst2:
res = requests.get(name, headers=headers)
info = {}
text2 = res.text
soup = BeautifulSoup(text2, 'html.parser')
info['地址'] = soup.select('.main')[0].text
info['總價'] = soup.select('.total')[0].text
info['每平方售價'] = soup.select('.unitPriceValue')[0].text
info['小區名稱'] = soup.select('.info')[0].text
info['所在區域'] = soup.select('.info a')[0].text + ':' + soup.select('.info a')[1].text
# 根據地址獲取對應經緯度,通過高德地圖的api接口來進行
mc = soup.select('.info')[0].text
location1 = '杭州' + mc
# print(location1)
base = 'https://restapi.amap.com/v3/geocode/geo?key=3e176b0540a337b449930fc4c12cab11&address='+location1
response = requests.get(base)
result = json.loads(response.text)
info['經緯度']=result['geocodes'][0]['location']
print(info)
with open('G:/新建文件夾/pc/image/a.csv', 'a', encoding='utf-8')as data:
print(str(info), file=data)
簡單看一下數據