環境:Windows7 +Python3.6+Pycharm2017
目標:抓取美團美食移動端 深圳地區店鋪的信息,包括:店鋪名稱、分類、地址、電話、人均消費、營業時間、評分、評價人數、經緯度。最後抓取2.1W條信息,程序運行約1h。工具(requests、selenium、chrome)
一、美團桌面端
打開深圳美團https://sz.meituan.com/,點擊美食,F12進入瀏覽器開發者模式。點擊右上方Network和XHR,然後隨便點擊一個分區,比如香蜜湖。可以抓到一個請求叫:getPoiList?cityName=XXXXXXXX。點擊可以看到請求的url中有一個參數_token。這個token參數應該通過某種算法算出來的,如果要模擬瀏覽器發請求,首先要知道如何生成token。這個token應該是通過JS生成的,一般遇到js加密的,要麼破解加密原理,然後自己用代碼實現。要麼就是直接調用它的js代碼。而且這個參數估計是最近幾個月才加進去的,網上查了一遍也沒有找到解決辦法,自己看js文件也看不出什麼,所以桌面端只能放棄。如有大神知道怎麼處理這個token,望告知,謝謝!!如果真要拿token,用selenium+chrome應該也可以,每個token應該有一段有效期。
二、美團移動端
桌面端搞不定,只能選擇其他途徑。現在很多網站都會有桌面版,移動版,還有APP,一般移動版的反爬會簡單些。打開美團移動版 https://i.meituan.com/ ,F12打開瀏覽器開發者模式,可以點擊下圖1處的兩個方框,模擬手機瀏覽器。
然後點擊美食,進入下圖界面,看到右邊的兩個請求。第一個請求是頁面的基本框架信息,比如上面各種分類信息,後面會用到。第二個請求list,是一個動態請求,用以獲得商家信息。點擊發現是一個post請求,請求的參數如下圖紅框中所示,多點擊幾家店鋪就能看出參數的含義。變化的就四個參數areaId--地區分類、cataId--美食分類、offset--翻頁參數、uuid--網站分發的id。
直接模擬瀏覽器發送post請求,修改offset來實現翻頁,每頁有15條數據,每翻一頁 offset值加15。實測在當前美食頁面下直接翻頁,最多能翻67頁,1005條數據,後面好像出驗證碼還是沒數據返回了。所以我們要對店鋪進行分類抓取。
我們需要的信息在店鋪的詳情頁面,一般詳情頁面的url都是幾個關鍵參數的拼湊,而這幾個關鍵參數是可以在上面的列表頁面抓取到的。我們點開一家店鋪,觀察url,發現主要是兩個參數,一個是店鋪的id:6268902,還有一個就是ct_poi參數,這兩個參數都可以在上面的post請求返回數據中找到。
還有就是我們進入頁面詳情瀏覽器能捕捉到很多的請求,我們需要的店鋪信息 店鋪名稱、分類、地址、電話、人均消費、營業時間、評分、評價人數、經緯度,是哪個請求返回的,需要確認下。實際就是第一個請求,上面這個url。
點開第一個請求返回的html代碼,直接ctrl+F搜索店鋪電話號碼,就能找到位置。在一個<script crossorigin='anonymous'>標籤中,這樣的標籤有好幾個,需要區分。用xpath解析的時候取標籤內容,然後截取內容字符串前16位,看是不是window._appState,以此判斷,剩下的就是json數據處理。
三、基本思路
至此,爬取的基本思路就有了。先通過列表頁面抓取店鋪的id和ct_poi參數,構造詳情頁面url,再訪問詳情頁面抓取信息。由於翻頁只能翻67頁,所以我們需要分類抓取。我們這裏選擇按區域分類,應該這樣可以保證每一個區域下店鋪數量小於67頁(1005條)。店鋪總數網站全城雖然顯示的是46655,但是下面每個區域加起來應該是2.4W,而且全部類目下顯示的也是總數2.4W,所以我覺得應該是總數在2.4W。所以現在的問題就是把每個區域的areaId抓到。
四、區域id抓取
點擊美食頁面 https://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1
查看html代碼,也是在一個<script crossorigin='anonymous'>標籤中,可以看到每個區域對應的id。只是在瀏覽器上顯示的數據並不完整,可以下載html到本地用編輯器打開。也是json格式數據的處理。這裏就是南澳新區的數據要特殊處理下,因爲它下面沒有分區,我直接把它加到了坪山區內。
五、店鋪id和ct_poi參數抓取
有了每個區域的id,可以直接構造post請求獲取店鋪信息,該請求需要加上cookie,一條cookie就可以抓完。返回數據是json格式,包含15條店鋪信息,提取其中的店鋪id和ct_poi保存到本地csv文件中。抓取完成後可以對信息做一次去重,店鋪id相同的就認爲是重複信息。代碼中把店鋪的分類cateName也保存下來,詳情頁面好像沒有這個信息。代碼如下,應該改下cookie就可以運行。去重後一共抓取到21872條數據。
#coding=utf-8
import csv
import time
import requests
import json
#區域店鋪id ct_Poi cateName抓取,傳入參數爲區域id
def crow_id(areaid):
id_list=[]
url='https://meishi.meituan.com/i/api/channel/deal/list'
head={'Host': 'meishi.meituan.com',
'Accept': 'application/json',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Referer': 'https://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Mobile Safari/537.36',
'Cookie':'XXXXXXXXXXXXXX'
}
p = {'https': 'https://27.157.76.75:4275'}
data={"uuid":"09dbb48e-4aed-4683-9ce5-c14b16ae7539","version":"8.3.3","platform":3,"app":"","partner":126,"riskLevel":1,"optimusCode":10,"originUrl":"http://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1","offset":0,"limit":15,"cateId":1,"lineId":0,"stationId":0,"areaId":areaid,"sort":"default","deal_attr_23":"","deal_attr_24":"","deal_attr_25":"","poi_attr_20043":"","poi_attr_20033":""}
r=requests.post(url,headers=head,data=data,proxies=p)
result=json.loads(r.text)
totalcount=result['data']['poiList']['totalCount'] #獲取該分區店鋪總數,計算出要翻的頁數
datas=result['data']['poiList']['poiInfos']
print(len(datas),totalcount)
for d in datas:
d_list=['','','','']
d_list[0]=d['name']
d_list[1] = d['cateName']
d_list[2] = d['poiid']
d_list[3] = d['ctPoi']
id_list.append(d_list)
print('Page:1')
#將數據保存到本地csv
with open('meituan_id.csv','a', newline='',encoding='gb18030')as f:
write=csv.writer(f)
for i in id_list:
write.writerow(i)
#開始爬取第2頁到最後一頁
offset=0
if totalcount>15:
totalcount-=15
while offset<totalcount:
id_list = []
offset+=15
m=offset/15+1
print('Page:%d'%m)
#構造post請求參數,通過改變offset實現翻頁
data2 = {"uuid": "09dbb48e-4aed-4683-9ce5-c14b16ae7539", "version": "8.3.3", "platform": 3, "app": "",
"partner": 126, "riskLevel": 1, "optimusCode": 10,
"originUrl": "http://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1",
"offset": offset, "limit": 15, "cateId": 1, "lineId": 0, "stationId": 0, "areaId": areaid, "sort": "default",
"deal_attr_23": "", "deal_attr_24": "", "deal_attr_25": "", "poi_attr_20043": "", "poi_attr_20033": ""}
try:
r = requests.post(url, headers=head, data=data2,proxies=p)
print(r.text)
result = json.loads(r.text)
datas = result['data']['poiList']['poiInfos']
print(len(datas))
for d in datas:
d_list = ['', '', '', '']
d_list[0] = d['name']
d_list[1] = d['cateName']
d_list[2] = d['poiid']
d_list[3] = d['ctPoi']
id_list.append(d_list)
#保存到本地
with open('meituan_id.csv', 'a', newline='',encoding='gb18030')as f:
write = csv.writer(f)
for i in id_list:
write.writerow(i)
except Exception as e:
print(e)
if __name__=='__main__':
#直接將html代碼中區域的信息複製出來,南澳新區的數據需要處理下,它下面沒有分區
a = {"areaObj": {"28": [{"id": 28, "name": "全部", "regionName": "福田區", "count": 4022},
{"id": 1056, "name": "香蜜湖", "regionName": "香蜜湖", "count": 105},
{"id": 744, "name": "梅林", "regionName": "梅林", "count": 421},
{"id": 1055, "name": "上沙/下沙", "regionName": "上沙/下沙", "count": 291},
{"id": 2008, "name": "華強南", "regionName": "華強南", "count": 263},
{"id": 742, "name": "八卦嶺/園嶺", "regionName": "八卦嶺/園嶺", "count": 217},
{"id": 741, "name": "華強北", "regionName": "華強北", "count": 572},
{"id": 743, "name": "皇崗/水圍", "regionName": "皇崗/水圍", "count": 136},
{"id": 756, "name": "新城市廣場", "regionName": "新城市廣場", "count": 140},
{"id": 6595, "name": "車公廟", "regionName": "車公廟", "count": 305},
{"id": 6596, "name": "景田", "regionName": "景田", "count": 144},
{"id": 6597, "name": "新洲/石廈", "regionName": "新洲/石廈", "count": 374},
{"id": 6974, "name": "竹子林", "regionName": "竹子林", "count": 107},
{"id": 6975, "name": "市民中心", "regionName": "市民中心", "count": 39},
{"id": 7993, "name": "會展中心", "regionName": "會展中心", "count": 461},
{"id": 7994, "name": "崗廈", "regionName": "崗廈", "count": 110},
{"id": 7996, "name": "福田保稅區", "regionName": "福田保稅區", "count": 29}],
"29": [{"id": 29, "name": "全部", "regionName": "羅湖區", "count": 2191},
{"id": 6976, "name": "國貿", "regionName": "國貿", "count": 232},
{"id": 758, "name": "蓮塘", "regionName": "蓮塘", "count": 125},
{"id": 2009, "name": "筍崗", "regionName": "筍崗", "count": 159},
{"id": 748, "name": "翠竹路沿線", "regionName": "翠竹路沿線", "count": 42},
{"id": 745, "name": "東門", "regionName": "東門", "count": 484},
{"id": 746, "name": "寶安南路沿線", "regionName": "寶安南路沿線", "count": 67},
{"id": 757, "name": "火車站", "regionName": "火車站", "count": 96},
{"id": 6598, "name": "萬象城", "regionName": "萬象城", "count": 127},
{"id": 6599, "name": "喜薈城/水庫", "regionName": "喜薈城/水庫", "count": 99},
{"id": 7659, "name": "地王大廈", "regionName": "地王大廈", "count": 85},
{"id": 8469, "name": "黃貝嶺", "regionName": "黃貝嶺", "count": 136},
{"id": 8470, "name": "春風萬佳/文錦渡", "regionName": "春風萬佳/文錦渡", "count": 19},
{"id": 8471, "name": "布心/太白路", "regionName": "布心/太白路", "count": 154},
{"id": 8790, "name": "田貝/水貝", "regionName": "田貝/水貝", "count": 85},
{"id": 8794, "name": "銀湖/泥崗", "regionName": "銀湖/泥崗", "count": 37},
{"id": 8795, "name": "新秀/羅芳", "regionName": "新秀/羅芳", "count": 33},
{"id": 13080, "name": "梧桐山", "regionName": "梧桐山", "count": 34},
{"id": 14095, "name": "KK mall", "regionName": "KK mall", "count": 74}],
"30": [{"id": 30, "name": "全部", "regionName": "南山區", "count": 3905},
{"id": 751, "name": "南頭", "regionName": "南頭", "count": 325},
{"id": 750, "name": "華僑城", "regionName": "華僑城", "count": 126},
{"id": 749, "name": "蛇口", "regionName": "蛇口", "count": 9},
{"id": 1057, "name": "南油", "regionName": "南油", "count": 218},
{"id": 1058, "name": "科技園", "regionName": "科技園", "count": 460},
{"id": 1059, "name": "西麗", "regionName": "西麗", "count": 586},
{"id": 4811, "name": "南山中心區", "regionName": "南山中心區", "count": 635},
{"id": 6591, "name": "海岸城/保利", "regionName": "海岸城/保利", "count": 158},
{"id": 6592, "name": "前海", "regionName": "前海", "count": 32},
{"id": 6593, "name": "白石洲", "regionName": "白石洲", "count": 190},
{"id": 6594, "name": "歡樂海岸", "regionName": "歡樂海岸", "count": 22},
{"id": 7597, "name": "太古城", "regionName": "太古城", "count": 57},
{"id": 7599, "name": "花園城", "regionName": "花園城", "count": 42},
{"id": 13109, "name": "海上世界", "regionName": "海上世界", "count": 225},
{"id": 23117, "name": "世界之窗", "regionName": "世界之窗", "count": 97},
{"id": 25152, "name": "南山京基百納", "regionName": "南山京基百納", "count": 22},
{"id": 36635, "name": "深圳灣", "regionName": "深圳灣", "count": 17}],
"31": [{"id": 31, "name": "全部", "regionName": "鹽田區", "count": 407},
{"id": 754, "name": "大小梅沙", "regionName": "大小梅沙", "count": 36},
{"id": 755, "name": "沙頭角", "regionName": "沙頭角", "count": 118},
{"id": 8789, "name": "東部華僑城", "regionName": "東部華僑城", "count": 11},
{"id": 8796, "name": "鹽田海鮮食街", "regionName": "鹽田海鮮食街", "count": 22},
{"id": 15349, "name": "壹海城", "regionName": "壹海城", "count": 51},
{"id": 38055, "name": "溪涌", "regionName": "溪涌", "count": ""}],
"32": [{"id": 32, "name": "全部", "regionName": "寶安區", "count": 6071},
{"id": 6587, "name": "西鄉", "regionName": "西鄉", "count": 15},
{"id": 6586, "name": "新安", "regionName": "新安", "count": 413},
{"id": 6585, "name": "石巖", "regionName": "石巖", "count": 466},
{"id": 752, "name": "寶安中心區", "regionName": "寶安中心區", "count": 458},
{"id": 4653, "name": "港隆城", "regionName": "港隆城", "count": 137},
{"id": 6588, "name": "沙井", "regionName": "沙井", "count": 824},
{"id": 6589, "name": "福永", "regionName": "福永", "count": 631},
{"id": 7684, "name": "鬆崗", "regionName": "鬆崗", "count": 435},
{"id": 7685, "name": "公明", "regionName": "公明", "count": 433},
{"id": 7719, "name": "海雅繽紛城", "regionName": "海雅繽紛城", "count": 125},
{"id": 7735, "name": "固戍", "regionName": "固戍", "count": 237},
{"id": 8006, "name": "桃源居", "regionName": "桃源居", "count": 25},
{"id": 14404, "name": "時代城", "regionName": "時代城", "count": 2},
{"id": 17088, "name": "羅田/燕川", "regionName": "羅田/燕川", "count": 45},
{"id": 17089, "name": "西田", "regionName": "西田", "count": 29},
{"id": 17091, "name": "圳美", "regionName": "圳美", "count": 32},
{"id": 17092, "name": "田寮/長圳", "regionName": "田寮/長圳", "count": 3},
{"id": 23524, "name": "沙井京基百納", "regionName": "沙井京基百納", "count": 98},
{"id": 27275, "name": "寶立方", "regionName": "寶立方", "count": 125},
{"id": 36634, "name": "寶安機場", "regionName": "寶安機場", "count": 244},
{"id": 37084, "name": "光明新區", "regionName": "光明新區", "count": 1}],
"33": [{"id": 33, "name": "全部", "regionName": "龍崗區", "count": 5193},
{"id": 753, "name": "羅崗/求水山", "regionName": "羅崗/求水山", "count": 145},
{"id": 6600, "name": "五和/民營市場", "regionName": "五和/民營市場", "count": 250},
{"id": 6601, "name": "平湖", "regionName": "平湖", "count": 356},
{"id": 7656, "name": "橫崗", "regionName": "橫崗", "count": 568},
{"id": 7658, "name": "南澳", "regionName": "南澳", "count": 32},
{"id": 7663, "name": "南聯", "regionName": "南聯", "count": 311},
{"id": 7664, "name": "坪地", "regionName": "坪地", "count": 131},
{"id": 8472, "name": "大運", "regionName": "大運", "count": 186},
{"id": 9013, "name": "李朗聚星商城", "regionName": "李朗聚星商城", "count": 63},
{"id": 13335, "name": "較場尾/大鵬所城", "regionName": "較場尾/大鵬所城", "count": 152},
{"id": 13358, "name": "水頭", "regionName": "水頭", "count": 20},
{"id": 13359, "name": "東涌", "regionName": "東涌", "count": 2},
{"id": 13361, "name": "萬科廣場/世貿", "regionName": "萬科廣場/世貿", "count": 107},
{"id": 13412, "name": "華南城/奧特萊斯", "regionName": "華南城/奧特萊斯", "count": 191},
{"id": 18069, "name": "大芬/南嶺", "regionName": "大芬/南嶺", "count": 359},
{"id": 18228, "name": "雙龍", "regionName": "雙龍", "count": 316},
{"id": 19456, "name": "慢城/三聯", "regionName": "慢城/三聯", "count": 111},
{"id": 19457, "name": "布吉街/東站/天虹", "regionName": "布吉街/東站/天虹", "count": 404},
{"id": 26297, "name": "天虹/阪田/楊美", "regionName": "天虹/阪田/楊美", "count": 344},
{"id": 26298, "name": "崗頭/萬科/雪象", "regionName": "崗頭/萬科/雪象", "count": 199},
{"id": 35919, "name": "華爲阪田基地", "regionName": "華爲阪田基地", "count": 9},
{"id": 36519, "name": "楊梅坑/桔釣沙", "regionName": "楊梅坑/桔釣沙", "count": 39},
{"id": 36520, "name": "葵涌", "regionName": "葵涌", "count": 37},
{"id": 36530, "name": "官湖", "regionName": "官湖", "count": 9},
{"id": 36531, "name": "西涌", "regionName": "西涌", "count": 49},
{"id": 36636, "name": "坪山高鐵站", "regionName": "坪山高鐵站", "count": 41},
{"id": 37501, "name": "龍崗中心城", "regionName": "龍崗中心城", "count": 365}],
"9553": [{"id": 9553, "name": "全部", "regionName": "龍華區", "count": 3080},
{"id": 1061, "name": "龍華", "regionName": "龍華", "count": 958},
{"id": 6584, "name": "民治", "regionName": "民治", "count": 164},
{"id": 7721, "name": "觀瀾", "regionName": "觀瀾", "count": 433},
{"id": 7722, "name": "大浪", "regionName": "大浪", "count": 398},
{"id": 9326, "name": "梅林關", "regionName": "梅林關", "count": 125},
{"id": 9327, "name": "錦繡江南", "regionName": "錦繡江南", "count": 33},
{"id": 36633, "name": "深圳北站", "regionName": "深圳北站", "count": 190},
{"id": 37723, "name": "龍華新區", "regionName": "龍華新區", "count": 14}],
"23420": [{"id": 23420, "name": "全部", "regionName": "坪山區", "count": 393},
{"id": 6602, "name": "坪山", "regionName": "坪山", "count": 232},
{"id": 23429, "name": "坑梓/竹坑", "regionName": "坑梓/竹坑", "count": 128},
{"id": 9535, "name": "南澳大鵬新區", "regionName": "南澳大鵬新區", "count": 91}]
}}
datas = a['areaObj']
b = datas.values()
area_list=[]
for data in b:
for d in data[1:]:
area_list.append(d) #將每個區域信息保存到列表,元素是字典
l=0
old=time.time()
for i in area_list:
l+=1
print('開始抓取第%d個區域:'%l,i['regionName'], '店鋪總數:',i['count'])
try:
crow_id(i['id'])
now=time.time()-old
print(i['name'],'抓取完成!','時間:%d'%now)
except Exception as e:
print(e)
六、店鋪詳情頁面抓取
店鋪詳情頁面的url已經可以構造,現在就是直接訪問。就是一個簡單的get請求,但是要帶上完整的cookie,cookie有問題的話很快會彈驗證碼。一個cookie可以爬1000次後纔會出現驗證碼,但是也有幾百次出現的。用requests的session模塊好像拿不到完整的cookie,本文是用selenium+chrome,使用代理ip訪問美團,然後獲取cookie,再把cookie和ip返回用以發起requests請求。實際測試中出現驗證碼後不換cookie,只更換ip也可以繼續抓取。
代碼有兩塊,一個是主程序,還有一個get_cookie文件,用以cookie、ip的獲取處理的,還有頁面詳情的解析模塊。cookie、ip處理函數,先提取一個ip(我買的代理),然後訪問美團深圳首頁,sleep幾秒,這個很關鍵,讓頁面完全加載,不然會少cookie。再訪問美食頁面。ip質量良莠不齊,使用前最好先測試下。這裏用訪問美食頁面所需的時間來判斷,大於3S的NG,重新提取ip。小於三秒的ok。然後獲取下cookie,這裏需要判斷cookie是否完整,主要是_utma、_utmc、_utmz這幾個參數有時會缺失,沒有這幾個參數很快會彈驗證碼,一般cookie長度18。頁面解析函數也很簡單,返回一個標誌位mark和店鋪信息info,標誌位用以判斷本次抓取是否成功。
主函數採用了多線程,比較簡單,先獲取ip、cookie,再開始爬取。需要注意的是爬取過程中異常的處理。主要異常有兩種,一個是timeout:這種異常先sleep1秒,再抓一次,還是不行的話就判斷本條抓取失敗,如果連續三條抓取失敗就需要重新獲取ip、cookie。還有就是直接報‘由於目標計算機積極拒絕,無法連接’,訪問次數太頻繁了,被服務器識別了,就需要重新獲取ip、cookie。
get_cookie 模塊代碼如下:
from selenium import webdriver
import requests
import time
import json
from lxml import etree
#返回一個ip和對應的cookie,cookie以字符串形式返回。ip需要經過測試
def get_cookie():
mark=0
while mark==0:
#購買的ip獲取地址
p_url = 'XXXXXXXXXXXXX'
r = requests.get(p_url)
html = json.loads(r.text)
a = html['data'][0]['ip']
b = html['data'][0]['port']
val = '--proxy-server=http://' + str(a) + ':' + str(b)
val2 = 'https://' + str(a) + ':' + str(b)
p = {'https': val2}
print('獲取IP:',p)
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(val)
driver = webdriver.Chrome(executable_path='C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe',chrome_options=chrome_options)
driver.set_page_load_timeout(8) #設置超時
driver.set_script_timeout(8)
url='https://i.meituan.com/shenzhen/' #美團深圳首頁
url2='https://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1'#美食頁面
try:
driver.get(url)
time.sleep(2.5)
c1=driver.get_cookies()
now = time.time()
driver.get(url2)
tt=time.time()-now
print(tt)
time.sleep(0.5)
#ip速度測試,打開時間大於3S的NG
if tt < 3:
c=driver.get_cookies()
driver.quit()
print('*******************')
print(len(c1),len(c))
#判斷cookie是否完整,正常的長度應該是18
if len(c)>17:
mark=1
# print(c)
x={}
for line in c:
x[line['name']]=line['value']
#將cookie合成字符串,以便添加到header中,字符串較長就分了兩段處理
co1='__mta='+x['__mta']+'; client-id='+x['client-id']+'; IJSESSIONID='+x['IJSESSIONID']+'; iuuid='+x['iuuid']+'; ci=30; cityname=%E6%B7%B1%E5%9C%B3; latlng=; webp=1; _lxsdk_cuid='+x['_lxsdk_cuid']+'; _lxsdk='+x['_lxsdk']
co2='; __utma='+x['__utma']+'; __utmc='+x['__utmc']+'; __utmz='+x['__utmz']+'; __utmb='+x['__utmb']+'; i_extend='+x['i_extend']+'; uuid='+x['uuid']+'; _hc.v='+x['_hc.v']+'; _lxsdk_s='+x['_lxsdk_s']
co=co1+co2
print(co)
return(p,co)
else:
print('缺少Cookie,長度:',len(c))
else:
print('超時')
driver.quit()
time.sleep(3)
except:
driver.quit()
pass
#解析店鋪詳情頁面,返回店鋪信息info和一個標誌位mark
#傳入參數u包含url和店鋪分類,pc包含cookie和ip,m代表抓取的數量,n表示線程號,ll表示剩餘店鋪數量,ttt該線程抓取的總時長
def parse(u,pc,m,n,ll,ttt):
mesg='Thread:'+str(n)+' No:'+str(m)+' Time:'+str(ttt)+' left:'+str(ll)#記錄當前線程爬取的信息
url = u[0]
cate = u[1]
p=pc[0]
cookie=pc[1]
mark = 0 #標誌位,0表示抓取正常,1,2表示兩種異常
head = {'Host': 'meishi.meituan.com',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Upgrade - Insecure - Requests': '1',
'Referer': 'https://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1',
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Mobile Safari/537.36',
'Cookie':cookie
}
info = [] #店鋪信息存儲
try:
r = requests.get(url, headers=head, timeout=3, proxies=p)
r.encoding = 'utf-8'
html = etree.HTML(r.text)
datas = html.xpath('body/script[@crossorigin="anonymous"]')
for data in datas:
try:
strs = data.text[:16]
if strs == 'window._appState':
result = data.text[19:-1]
result = json.loads(result)
name = result['poiInfo']['name']
addr = result['poiInfo']['addr']
phone = result['poiInfo']['phone']
aveprice = result['poiInfo']['avgPrice']
opentime = result['poiInfo']['openInfo']
opentime = opentime.replace('\n', ' ')
avescore = result['poiInfo']['avgScore']
marknum = result['poiInfo']['MarkNumbers']
lng = result['poiInfo']['lng']
lat = result['poiInfo']['lat']
info = [name, cate, addr, phone, aveprice, opentime, avescore, marknum, lng, lat]
print(url)
print(mesg,name, cate, addr, phone, aveprice, opentime, avescore, marknum, lng, lat)
except:
pass
except Exception as e:
print('Error Thread:',n) #打印出異常的線程號
print(e)
s = str(e)[-22:-6]
if s == '由於目標計算機積極拒絕,無法連接':
print('由於目標計算機積極拒絕,無法連接',n)
mark=1 #1類錯誤,需要更換ip
else:
mark=2 #2類錯誤,再抓取一次
return(mark,info) #返回標誌位和店鋪信息
主函數模塊代碼如下:
import csv
import time
import threading
from get_cookie import get_cookie
from get_cookie import parse
def crow(n,l): #參數n 區分第幾個線程,l存儲url的列表
lock=threading.Lock()
sym=0 #是否連續三次抓取失敗的標誌位
pc=get_cookie() #獲取IP 和 Cookie
m=0 #記錄抓取的數量
now=time.time()
while True:
if len(l)>0:
u=l.pop(0)
ll=len(l)
m+=1
ttt=time.time()-now
result=parse(u,pc,m,n,ll,ttt)
mark=result[0]
info=result[1]
if mark==2:
time.sleep(1.5)
result = parse(u, pc,m,n,ll,ttt)
mark = result[0]
info = result[1]
if mark !=0:
sym+=1
if mark==1:
pc=get_cookie()
result = parse(u, pc,m,n,ll,ttt)
mark = result[0]
info = result[1]
if mark !=0:
sym+=1
if mark==0: #抓取成功
sym=0
lock.acquire()
with open('meituan.csv', 'a', newline='', encoding='gb18030')as f:
write = csv.writer(f)
write.writerow(info)
f.close()
lock.release()
if sym>2: #連續三次抓取失敗,換ip、cookie
sym=0
pc=get_cookie()
else:
print('&&&&線程:%d結束'%n)
break
if __name__=='__main__':
url_list=[]
with open('mt_id.csv','r',encoding='gb18030')as f:
read=csv.reader(f)
for line in read:
d_list=['','']
url='https://meishi.meituan.com/i/poi/'+str(line[2])+'?ct_poi='+str(line[3])
d_list[0]=url
d_list[1]=line[1]
url_list.append(d_list)
f.close()
th_list=[]
for i in range(1,6):
t=threading.Thread(target=crow,args=(i,url_list,))
print('*****線程%d開始啓動...'%i)
t.start()
th_list.append(t)
time.sleep(30)
for t in th_list:
t.join()
七、結果
開5個線程的話應該一個小時就可以抓完,最後一共抓取到21828條數據,丟了不到50條數據。
水平有限,如有錯誤望指正。還有桌面版的抓取如有解決方法望告知,謝謝。