Python爬蟲四:美團爬蟲(店鋪信息抓取)

環境:Windows7 +Python3.6+Pycharm2017

目標:抓取美團美食移動端 深圳地區店鋪的信息,包括:店鋪名稱、分類、地址、電話、人均消費、營業時間、評分、評價人數、經緯度。最後抓取2.1W條信息,程序運行約1h。工具(requests、selenium、chrome)

一、美團桌面端

打開深圳美團https://sz.meituan.com/,點擊美食,F12進入瀏覽器開發者模式。點擊右上方Network和XHR,然後隨便點擊一個分區,比如香蜜湖。可以抓到一個請求叫:getPoiList?cityName=XXXXXXXX。點擊可以看到請求的url中有一個參數_token。這個token參數應該通過某種算法算出來的,如果要模擬瀏覽器發請求,首先要知道如何生成token。這個token應該是通過JS生成的,一般遇到js加密的,要麼破解加密原理,然後自己用代碼實現。要麼就是直接調用它的js代碼。而且這個參數估計是最近幾個月才加進去的,網上查了一遍也沒有找到解決辦法,自己看js文件也看不出什麼,所以桌面端只能放棄。如有大神知道怎麼處理這個token,望告知,謝謝!!如果真要拿token,用selenium+chrome應該也可以,每個token應該有一段有效期。

二、美團移動端 

桌面端搞不定,只能選擇其他途徑。現在很多網站都會有桌面版,移動版,還有APP,一般移動版的反爬會簡單些。打開美團移動版 https://i.meituan.com/ ,F12打開瀏覽器開發者模式,可以點擊下圖1處的兩個方框,模擬手機瀏覽器。

 然後點擊美食,進入下圖界面,看到右邊的兩個請求。第一個請求是頁面的基本框架信息,比如上面各種分類信息,後面會用到。第二個請求list,是一個動態請求,用以獲得商家信息。點擊發現是一個post請求,請求的參數如下圖紅框中所示,多點擊幾家店鋪就能看出參數的含義。變化的就四個參數areaId--地區分類、cataId--美食分類、offset--翻頁參數、uuid--網站分發的id。

直接模擬瀏覽器發送post請求,修改offset來實現翻頁,每頁有15條數據,每翻一頁 offset值加15。實測在當前美食頁面下直接翻頁,最多能翻67頁,1005條數據,後面好像出驗證碼還是沒數據返回了。所以我們要對店鋪進行分類抓取。

我們需要的信息在店鋪的詳情頁面,一般詳情頁面的url都是幾個關鍵參數的拼湊,而這幾個關鍵參數是可以在上面的列表頁面抓取到的。我們點開一家店鋪,觀察url,發現主要是兩個參數,一個是店鋪的id:6268902,還有一個就是ct_poi參數,這兩個參數都可以在上面的post請求返回數據中找到。

https://meishi.meituan.com/i/poi/6268902?ct_poi=314286840956592200722254147016600281179_a6268902_c0_e11543712825375195158 

還有就是我們進入頁面詳情瀏覽器能捕捉到很多的請求,我們需要的店鋪信息 店鋪名稱、分類、地址、電話、人均消費、營業時間、評分、評價人數、經緯度,是哪個請求返回的,需要確認下。實際就是第一個請求,上面這個url。

點開第一個請求返回的html代碼,直接ctrl+F搜索店鋪電話號碼,就能找到位置。在一個<script crossorigin='anonymous'>標籤中,這樣的標籤有好幾個,需要區分。用xpath解析的時候取標籤內容,然後截取內容字符串前16位,看是不是window._appState,以此判斷,剩下的就是json數據處理。

三、基本思路 

至此,爬取的基本思路就有了。先通過列表頁面抓取店鋪的id和ct_poi參數,構造詳情頁面url,再訪問詳情頁面抓取信息。由於翻頁只能翻67頁,所以我們需要分類抓取。我們這裏選擇按區域分類,應該這樣可以保證每一個區域下店鋪數量小於67頁(1005條)。店鋪總數網站全城雖然顯示的是46655,但是下面每個區域加起來應該是2.4W,而且全部類目下顯示的也是總數2.4W,所以我覺得應該是總數在2.4W。所以現在的問題就是把每個區域的areaId抓到。

四、區域id抓取

點擊美食頁面 https://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1

查看html代碼,也是在一個<script crossorigin='anonymous'>標籤中,可以看到每個區域對應的id。只是在瀏覽器上顯示的數據並不完整,可以下載html到本地用編輯器打開。也是json格式數據的處理。這裏就是南澳新區的數據要特殊處理下,因爲它下面沒有分區,我直接把它加到了坪山區內。

五、店鋪id和ct_poi參數抓取

有了每個區域的id,可以直接構造post請求獲取店鋪信息,該請求需要加上cookie,一條cookie就可以抓完。返回數據是json格式,包含15條店鋪信息,提取其中的店鋪id和ct_poi保存到本地csv文件中。抓取完成後可以對信息做一次去重,店鋪id相同的就認爲是重複信息。代碼中把店鋪的分類cateName也保存下來,詳情頁面好像沒有這個信息。代碼如下,應該改下cookie就可以運行。去重後一共抓取到21872條數據。

#coding=utf-8
import csv
import time
import requests
import json


#區域店鋪id ct_Poi cateName抓取,傳入參數爲區域id
def crow_id(areaid):
    id_list=[]
    url='https://meishi.meituan.com/i/api/channel/deal/list'
    head={'Host': 'meishi.meituan.com',
          'Accept': 'application/json',
          'Accept-Encoding': 'gzip, deflate, br',
          'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
          'Referer': 'https://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1',
          'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Mobile Safari/537.36',
           'Cookie':'XXXXXXXXXXXXXX'
                    }
    p = {'https': 'https://27.157.76.75:4275'}
    data={"uuid":"09dbb48e-4aed-4683-9ce5-c14b16ae7539","version":"8.3.3","platform":3,"app":"","partner":126,"riskLevel":1,"optimusCode":10,"originUrl":"http://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1","offset":0,"limit":15,"cateId":1,"lineId":0,"stationId":0,"areaId":areaid,"sort":"default","deal_attr_23":"","deal_attr_24":"","deal_attr_25":"","poi_attr_20043":"","poi_attr_20033":""}
    r=requests.post(url,headers=head,data=data,proxies=p)
    result=json.loads(r.text)
    totalcount=result['data']['poiList']['totalCount']  #獲取該分區店鋪總數,計算出要翻的頁數
    datas=result['data']['poiList']['poiInfos']
    print(len(datas),totalcount)
    for d in datas:
        d_list=['','','','']
        d_list[0]=d['name']
        d_list[1] = d['cateName']
        d_list[2] = d['poiid']
        d_list[3] = d['ctPoi']
        id_list.append(d_list)
    print('Page:1')
    #將數據保存到本地csv
    with open('meituan_id.csv','a', newline='',encoding='gb18030')as f:
        write=csv.writer(f)
        for i in id_list:
            write.writerow(i)

    #開始爬取第2頁到最後一頁
    offset=0
    if totalcount>15:
        totalcount-=15
        while offset<totalcount:
            id_list = []
            offset+=15
            m=offset/15+1
            print('Page:%d'%m)
            #構造post請求參數,通過改變offset實現翻頁
            data2 = {"uuid": "09dbb48e-4aed-4683-9ce5-c14b16ae7539", "version": "8.3.3", "platform": 3, "app": "",
                    "partner": 126, "riskLevel": 1, "optimusCode": 10,
                    "originUrl": "http://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1",
                    "offset": offset, "limit": 15, "cateId": 1, "lineId": 0, "stationId": 0, "areaId": areaid, "sort": "default",
                    "deal_attr_23": "", "deal_attr_24": "", "deal_attr_25": "", "poi_attr_20043": "", "poi_attr_20033": ""}
            try:
                r = requests.post(url, headers=head, data=data2,proxies=p)
                print(r.text)
                result = json.loads(r.text)
                datas = result['data']['poiList']['poiInfos']
                print(len(datas))
                for d in datas:
                    d_list = ['', '', '', '']
                    d_list[0] = d['name']
                    d_list[1] = d['cateName']
                    d_list[2] = d['poiid']
                    d_list[3] = d['ctPoi']
                    id_list.append(d_list)
                #保存到本地
                with open('meituan_id.csv', 'a', newline='',encoding='gb18030')as f:
                    write = csv.writer(f)
                    for i in id_list:
                        write.writerow(i)
            except Exception as e:
                print(e)


if __name__=='__main__':
    #直接將html代碼中區域的信息複製出來,南澳新區的數據需要處理下,它下面沒有分區
    a = {"areaObj": {"28": [{"id": 28, "name": "全部", "regionName": "福田區", "count": 4022},
                            {"id": 1056, "name": "香蜜湖", "regionName": "香蜜湖", "count": 105},
                            {"id": 744, "name": "梅林", "regionName": "梅林", "count": 421},
                            {"id": 1055, "name": "上沙/下沙", "regionName": "上沙/下沙", "count": 291},
                            {"id": 2008, "name": "華強南", "regionName": "華強南", "count": 263},
                            {"id": 742, "name": "八卦嶺/園嶺", "regionName": "八卦嶺/園嶺", "count": 217},
                            {"id": 741, "name": "華強北", "regionName": "華強北", "count": 572},
                            {"id": 743, "name": "皇崗/水圍", "regionName": "皇崗/水圍", "count": 136},
                            {"id": 756, "name": "新城市廣場", "regionName": "新城市廣場", "count": 140},
                            {"id": 6595, "name": "車公廟", "regionName": "車公廟", "count": 305},
                            {"id": 6596, "name": "景田", "regionName": "景田", "count": 144},
                            {"id": 6597, "name": "新洲/石廈", "regionName": "新洲/石廈", "count": 374},
                            {"id": 6974, "name": "竹子林", "regionName": "竹子林", "count": 107},
                            {"id": 6975, "name": "市民中心", "regionName": "市民中心", "count": 39},
                            {"id": 7993, "name": "會展中心", "regionName": "會展中心", "count": 461},
                            {"id": 7994, "name": "崗廈", "regionName": "崗廈", "count": 110},
                            {"id": 7996, "name": "福田保稅區", "regionName": "福田保稅區", "count": 29}],
                     "29": [{"id": 29, "name": "全部", "regionName": "羅湖區", "count": 2191},
                            {"id": 6976, "name": "國貿", "regionName": "國貿", "count": 232},
                            {"id": 758, "name": "蓮塘", "regionName": "蓮塘", "count": 125},
                            {"id": 2009, "name": "筍崗", "regionName": "筍崗", "count": 159},
                            {"id": 748, "name": "翠竹路沿線", "regionName": "翠竹路沿線", "count": 42},
                            {"id": 745, "name": "東門", "regionName": "東門", "count": 484},
                            {"id": 746, "name": "寶安南路沿線", "regionName": "寶安南路沿線", "count": 67},
                            {"id": 757, "name": "火車站", "regionName": "火車站", "count": 96},
                            {"id": 6598, "name": "萬象城", "regionName": "萬象城", "count": 127},
                            {"id": 6599, "name": "喜薈城/水庫", "regionName": "喜薈城/水庫", "count": 99},
                            {"id": 7659, "name": "地王大廈", "regionName": "地王大廈", "count": 85},
                            {"id": 8469, "name": "黃貝嶺", "regionName": "黃貝嶺", "count": 136},
                            {"id": 8470, "name": "春風萬佳/文錦渡", "regionName": "春風萬佳/文錦渡", "count": 19},
                            {"id": 8471, "name": "布心/太白路", "regionName": "布心/太白路", "count": 154},
                            {"id": 8790, "name": "田貝/水貝", "regionName": "田貝/水貝", "count": 85},
                            {"id": 8794, "name": "銀湖/泥崗", "regionName": "銀湖/泥崗", "count": 37},
                            {"id": 8795, "name": "新秀/羅芳", "regionName": "新秀/羅芳", "count": 33},
                            {"id": 13080, "name": "梧桐山", "regionName": "梧桐山", "count": 34},
                            {"id": 14095, "name": "KK mall", "regionName": "KK mall", "count": 74}],
                     "30": [{"id": 30, "name": "全部", "regionName": "南山區", "count": 3905},
                            {"id": 751, "name": "南頭", "regionName": "南頭", "count": 325},
                            {"id": 750, "name": "華僑城", "regionName": "華僑城", "count": 126},
                            {"id": 749, "name": "蛇口", "regionName": "蛇口", "count": 9},
                            {"id": 1057, "name": "南油", "regionName": "南油", "count": 218},
                            {"id": 1058, "name": "科技園", "regionName": "科技園", "count": 460},
                            {"id": 1059, "name": "西麗", "regionName": "西麗", "count": 586},
                            {"id": 4811, "name": "南山中心區", "regionName": "南山中心區", "count": 635},
                            {"id": 6591, "name": "海岸城/保利", "regionName": "海岸城/保利", "count": 158},
                            {"id": 6592, "name": "前海", "regionName": "前海", "count": 32},
                            {"id": 6593, "name": "白石洲", "regionName": "白石洲", "count": 190},
                            {"id": 6594, "name": "歡樂海岸", "regionName": "歡樂海岸", "count": 22},
                            {"id": 7597, "name": "太古城", "regionName": "太古城", "count": 57},
                            {"id": 7599, "name": "花園城", "regionName": "花園城", "count": 42},
                            {"id": 13109, "name": "海上世界", "regionName": "海上世界", "count": 225},
                            {"id": 23117, "name": "世界之窗", "regionName": "世界之窗", "count": 97},
                            {"id": 25152, "name": "南山京基百納", "regionName": "南山京基百納", "count": 22},
                            {"id": 36635, "name": "深圳灣", "regionName": "深圳灣", "count": 17}],
                     "31": [{"id": 31, "name": "全部", "regionName": "鹽田區", "count": 407},
                            {"id": 754, "name": "大小梅沙", "regionName": "大小梅沙", "count": 36},
                            {"id": 755, "name": "沙頭角", "regionName": "沙頭角", "count": 118},
                            {"id": 8789, "name": "東部華僑城", "regionName": "東部華僑城", "count": 11},
                            {"id": 8796, "name": "鹽田海鮮食街", "regionName": "鹽田海鮮食街", "count": 22},
                            {"id": 15349, "name": "壹海城", "regionName": "壹海城", "count": 51},
                            {"id": 38055, "name": "溪涌", "regionName": "溪涌", "count": ""}],
                     "32": [{"id": 32, "name": "全部", "regionName": "寶安區", "count": 6071},
                            {"id": 6587, "name": "西鄉", "regionName": "西鄉", "count": 15},
                            {"id": 6586, "name": "新安", "regionName": "新安", "count": 413},
                            {"id": 6585, "name": "石巖", "regionName": "石巖", "count": 466},
                            {"id": 752, "name": "寶安中心區", "regionName": "寶安中心區", "count": 458},
                            {"id": 4653, "name": "港隆城", "regionName": "港隆城", "count": 137},
                            {"id": 6588, "name": "沙井", "regionName": "沙井", "count": 824},
                            {"id": 6589, "name": "福永", "regionName": "福永", "count": 631},
                            {"id": 7684, "name": "鬆崗", "regionName": "鬆崗", "count": 435},
                            {"id": 7685, "name": "公明", "regionName": "公明", "count": 433},
                            {"id": 7719, "name": "海雅繽紛城", "regionName": "海雅繽紛城", "count": 125},
                            {"id": 7735, "name": "固戍", "regionName": "固戍", "count": 237},
                            {"id": 8006, "name": "桃源居", "regionName": "桃源居", "count": 25},
                            {"id": 14404, "name": "時代城", "regionName": "時代城", "count": 2},
                            {"id": 17088, "name": "羅田/燕川", "regionName": "羅田/燕川", "count": 45},
                            {"id": 17089, "name": "西田", "regionName": "西田", "count": 29},
                            {"id": 17091, "name": "圳美", "regionName": "圳美", "count": 32},
                            {"id": 17092, "name": "田寮/長圳", "regionName": "田寮/長圳", "count": 3},
                            {"id": 23524, "name": "沙井京基百納", "regionName": "沙井京基百納", "count": 98},
                            {"id": 27275, "name": "寶立方", "regionName": "寶立方", "count": 125},
                            {"id": 36634, "name": "寶安機場", "regionName": "寶安機場", "count": 244},
                            {"id": 37084, "name": "光明新區", "regionName": "光明新區", "count": 1}],
                     "33": [{"id": 33, "name": "全部", "regionName": "龍崗區", "count": 5193},
                            {"id": 753, "name": "羅崗/求水山", "regionName": "羅崗/求水山", "count": 145},
                            {"id": 6600, "name": "五和/民營市場", "regionName": "五和/民營市場", "count": 250},
                            {"id": 6601, "name": "平湖", "regionName": "平湖", "count": 356},
                            {"id": 7656, "name": "橫崗", "regionName": "橫崗", "count": 568},
                            {"id": 7658, "name": "南澳", "regionName": "南澳", "count": 32},
                            {"id": 7663, "name": "南聯", "regionName": "南聯", "count": 311},
                            {"id": 7664, "name": "坪地", "regionName": "坪地", "count": 131},
                            {"id": 8472, "name": "大運", "regionName": "大運", "count": 186},
                            {"id": 9013, "name": "李朗聚星商城", "regionName": "李朗聚星商城", "count": 63},
                            {"id": 13335, "name": "較場尾/大鵬所城", "regionName": "較場尾/大鵬所城", "count": 152},
                            {"id": 13358, "name": "水頭", "regionName": "水頭", "count": 20},
                            {"id": 13359, "name": "東涌", "regionName": "東涌", "count": 2},
                            {"id": 13361, "name": "萬科廣場/世貿", "regionName": "萬科廣場/世貿", "count": 107},
                            {"id": 13412, "name": "華南城/奧特萊斯", "regionName": "華南城/奧特萊斯", "count": 191},
                            {"id": 18069, "name": "大芬/南嶺", "regionName": "大芬/南嶺", "count": 359},
                            {"id": 18228, "name": "雙龍", "regionName": "雙龍", "count": 316},
                            {"id": 19456, "name": "慢城/三聯", "regionName": "慢城/三聯", "count": 111},
                            {"id": 19457, "name": "布吉街/東站/天虹", "regionName": "布吉街/東站/天虹", "count": 404},
                            {"id": 26297, "name": "天虹/阪田/楊美", "regionName": "天虹/阪田/楊美", "count": 344},
                            {"id": 26298, "name": "崗頭/萬科/雪象", "regionName": "崗頭/萬科/雪象", "count": 199},
                            {"id": 35919, "name": "華爲阪田基地", "regionName": "華爲阪田基地", "count": 9},
                            {"id": 36519, "name": "楊梅坑/桔釣沙", "regionName": "楊梅坑/桔釣沙", "count": 39},
                            {"id": 36520, "name": "葵涌", "regionName": "葵涌", "count": 37},
                            {"id": 36530, "name": "官湖", "regionName": "官湖", "count": 9},
                            {"id": 36531, "name": "西涌", "regionName": "西涌", "count": 49},
                            {"id": 36636, "name": "坪山高鐵站", "regionName": "坪山高鐵站", "count": 41},
                            {"id": 37501, "name": "龍崗中心城", "regionName": "龍崗中心城", "count": 365}],
                     "9553": [{"id": 9553, "name": "全部", "regionName": "龍華區", "count": 3080},
                              {"id": 1061, "name": "龍華", "regionName": "龍華", "count": 958},
                              {"id": 6584, "name": "民治", "regionName": "民治", "count": 164},
                              {"id": 7721, "name": "觀瀾", "regionName": "觀瀾", "count": 433},
                              {"id": 7722, "name": "大浪", "regionName": "大浪", "count": 398},
                              {"id": 9326, "name": "梅林關", "regionName": "梅林關", "count": 125},
                              {"id": 9327, "name": "錦繡江南", "regionName": "錦繡江南", "count": 33},
                              {"id": 36633, "name": "深圳北站", "regionName": "深圳北站", "count": 190},
                              {"id": 37723, "name": "龍華新區", "regionName": "龍華新區", "count": 14}],
                     "23420": [{"id": 23420, "name": "全部", "regionName": "坪山區", "count": 393},
                               {"id": 6602, "name": "坪山", "regionName": "坪山", "count": 232},
                               {"id": 23429, "name": "坑梓/竹坑", "regionName": "坑梓/竹坑", "count": 128},
                               {"id": 9535, "name": "南澳大鵬新區", "regionName": "南澳大鵬新區", "count": 91}]

                     }}

    datas = a['areaObj']
    b = datas.values()
    area_list=[]
    for data in b:
        for d in data[1:]:
            area_list.append(d)  #將每個區域信息保存到列表,元素是字典
    l=0
    old=time.time()
    for i in area_list:
        l+=1
        print('開始抓取第%d個區域:'%l,i['regionName'], '店鋪總數:',i['count'])
        try:
            crow_id(i['id'])
            now=time.time()-old
            print(i['name'],'抓取完成!','時間:%d'%now)
        except Exception as e:
            print(e)

   

六、店鋪詳情頁面抓取

店鋪詳情頁面的url已經可以構造,現在就是直接訪問。就是一個簡單的get請求,但是要帶上完整的cookie,cookie有問題的話很快會彈驗證碼。一個cookie可以爬1000次後纔會出現驗證碼,但是也有幾百次出現的。用requests的session模塊好像拿不到完整的cookie,本文是用selenium+chrome,使用代理ip訪問美團,然後獲取cookie,再把cookie和ip返回用以發起requests請求。實際測試中出現驗證碼後不換cookie,只更換ip也可以繼續抓取。

 代碼有兩塊,一個是主程序,還有一個get_cookie文件,用以cookie、ip的獲取處理的,還有頁面詳情的解析模塊。cookie、ip處理函數,先提取一個ip(我買的代理),然後訪問美團深圳首頁,sleep幾秒,這個很關鍵,讓頁面完全加載,不然會少cookie。再訪問美食頁面。ip質量良莠不齊,使用前最好先測試下。這裏用訪問美食頁面所需的時間來判斷,大於3S的NG,重新提取ip。小於三秒的ok。然後獲取下cookie,這裏需要判斷cookie是否完整,主要是_utma、_utmc、_utmz這幾個參數有時會缺失,沒有這幾個參數很快會彈驗證碼,一般cookie長度18。頁面解析函數也很簡單,返回一個標誌位mark和店鋪信息info,標誌位用以判斷本次抓取是否成功。

主函數採用了多線程,比較簡單,先獲取ip、cookie,再開始爬取。需要注意的是爬取過程中異常的處理。主要異常有兩種,一個是timeout:這種異常先sleep1秒,再抓一次,還是不行的話就判斷本條抓取失敗,如果連續三條抓取失敗就需要重新獲取ip、cookie。還有就是直接報‘由於目標計算機積極拒絕,無法連接’,訪問次數太頻繁了,被服務器識別了,就需要重新獲取ip、cookie。

get_cookie 模塊代碼如下:

 

from selenium import webdriver
import requests
import time
import json
from lxml import etree
#返回一個ip和對應的cookie,cookie以字符串形式返回。ip需要經過測試
def get_cookie():
    mark=0
    while mark==0:
        #購買的ip獲取地址
        p_url = 'XXXXXXXXXXXXX'
        r = requests.get(p_url)
        html = json.loads(r.text)
        a = html['data'][0]['ip']
        b = html['data'][0]['port']
        val = '--proxy-server=http://' + str(a) + ':' + str(b)
        val2 = 'https://' + str(a) + ':' + str(b)
        p = {'https': val2}
        print('獲取IP:',p)
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_argument(val)
        driver = webdriver.Chrome(executable_path='C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe',chrome_options=chrome_options)
        driver.set_page_load_timeout(8) #設置超時
        driver.set_script_timeout(8)
        url='https://i.meituan.com/shenzhen/'   #美團深圳首頁
        url2='https://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1'#美食頁面
        try:
            driver.get(url)
            time.sleep(2.5)
            c1=driver.get_cookies()
            now = time.time()
            driver.get(url2)
            tt=time.time()-now
            print(tt)
            time.sleep(0.5)
            #ip速度測試,打開時間大於3S的NG
            if tt < 3:
                c=driver.get_cookies()
                driver.quit()
                print('*******************')
                print(len(c1),len(c))
                #判斷cookie是否完整,正常的長度應該是18
                if len(c)>17:
                    mark=1
                    # print(c)
                    x={}
                    for line in c:
                        x[line['name']]=line['value']
                    #將cookie合成字符串,以便添加到header中,字符串較長就分了兩段處理
                    co1='__mta='+x['__mta']+'; client-id='+x['client-id']+'; IJSESSIONID='+x['IJSESSIONID']+'; iuuid='+x['iuuid']+'; ci=30; cityname=%E6%B7%B1%E5%9C%B3; latlng=; webp=1; _lxsdk_cuid='+x['_lxsdk_cuid']+'; _lxsdk='+x['_lxsdk']
                    co2='; __utma='+x['__utma']+'; __utmc='+x['__utmc']+'; __utmz='+x['__utmz']+'; __utmb='+x['__utmb']+'; i_extend='+x['i_extend']+'; uuid='+x['uuid']+'; _hc.v='+x['_hc.v']+'; _lxsdk_s='+x['_lxsdk_s']
                    co=co1+co2
                    print(co)
                    return(p,co)
                else:
                    print('缺少Cookie,長度:',len(c))
            else:
                print('超時')
                driver.quit()
                time.sleep(3)
        except:
            driver.quit()
            pass


     #解析店鋪詳情頁面,返回店鋪信息info和一個標誌位mark
     #傳入參數u包含url和店鋪分類,pc包含cookie和ip,m代表抓取的數量,n表示線程號,ll表示剩餘店鋪數量,ttt該線程抓取的總時長
def parse(u,pc,m,n,ll,ttt):
    mesg='Thread:'+str(n)+' No:'+str(m)+' Time:'+str(ttt)+' left:'+str(ll)#記錄當前線程爬取的信息
    url = u[0]
    cate = u[1]
    p=pc[0]
    cookie=pc[1]
    mark = 0 #標誌位,0表示抓取正常,1,2表示兩種異常
    head = {'Host': 'meishi.meituan.com',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
            'Accept-Encoding': 'gzip, deflate, br',
            'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
            'Cache-Control': 'max-age=0',
            'Connection': 'keep-alive',
            'Upgrade - Insecure - Requests': '1',
            'Referer': 'https://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1',
            'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Mobile Safari/537.36',
            'Cookie':cookie
            }
    info = [] #店鋪信息存儲
    try:
        r = requests.get(url, headers=head, timeout=3, proxies=p)
        r.encoding = 'utf-8'
        html = etree.HTML(r.text)
        datas = html.xpath('body/script[@crossorigin="anonymous"]')
        for data in datas:
            try:
                strs = data.text[:16]
                if strs == 'window._appState':
                    result = data.text[19:-1]
                    result = json.loads(result)
                    name = result['poiInfo']['name']
                    addr = result['poiInfo']['addr']
                    phone = result['poiInfo']['phone']
                    aveprice = result['poiInfo']['avgPrice']
                    opentime = result['poiInfo']['openInfo']
                    opentime = opentime.replace('\n', ' ')
                    avescore = result['poiInfo']['avgScore']
                    marknum = result['poiInfo']['MarkNumbers']
                    lng = result['poiInfo']['lng']
                    lat = result['poiInfo']['lat']
                    info = [name, cate, addr, phone, aveprice, opentime, avescore, marknum, lng, lat]
                    print(url)
                    print(mesg,name, cate, addr, phone, aveprice, opentime, avescore, marknum, lng, lat)
            except:
                pass
    except Exception  as e:
        print('Error  Thread:',n) #打印出異常的線程號
        print(e)
        s = str(e)[-22:-6]
        if s == '由於目標計算機積極拒絕,無法連接':
            print('由於目標計算機積極拒絕,無法連接',n)
            mark=1   #1類錯誤,需要更換ip
        else:
            mark=2   #2類錯誤,再抓取一次
    return(mark,info) #返回標誌位和店鋪信息


主函數模塊代碼如下:

import csv
import time
import threading
from get_cookie import get_cookie 
from get_cookie import parse

def crow(n,l): #參數n 區分第幾個線程,l存儲url的列表
    lock=threading.Lock()
    sym=0 #是否連續三次抓取失敗的標誌位
    pc=get_cookie()  #獲取IP 和 Cookie
    m=0 #記錄抓取的數量
    now=time.time()
    while True:
        if len(l)>0:
            u=l.pop(0)
            ll=len(l)
            m+=1
            ttt=time.time()-now
            result=parse(u,pc,m,n,ll,ttt)
            mark=result[0]
            info=result[1]
            if mark==2:
                time.sleep(1.5)
                result = parse(u, pc,m,n,ll,ttt)
                mark = result[0]
                info = result[1]
                if mark !=0:
                    sym+=1
            if mark==1:
                pc=get_cookie()
                result = parse(u, pc,m,n,ll,ttt)
                mark = result[0]
                info = result[1]
                if mark !=0:
                    sym+=1
            if mark==0: #抓取成功
                sym=0
                lock.acquire()
                with open('meituan.csv', 'a', newline='', encoding='gb18030')as f:
                    write = csv.writer(f)
                    write.writerow(info)
                f.close()
                lock.release()
            if sym>2: #連續三次抓取失敗,換ip、cookie
                sym=0
                pc=get_cookie()
        else:
            print('&&&&線程:%d結束'%n)
            break


if __name__=='__main__':
    url_list=[]
    with open('mt_id.csv','r',encoding='gb18030')as f:
        read=csv.reader(f)
        for line in read:
            d_list=['','']
            url='https://meishi.meituan.com/i/poi/'+str(line[2])+'?ct_poi='+str(line[3])
            d_list[0]=url
            d_list[1]=line[1]
            url_list.append(d_list)
        f.close()
    th_list=[]
    for i in range(1,6):
        t=threading.Thread(target=crow,args=(i,url_list,))
        print('*****線程%d開始啓動...'%i)
        t.start()
        th_list.append(t)
        time.sleep(30)
    for t in th_list:
        t.join()

七、結果

開5個線程的話應該一個小時就可以抓完,最後一共抓取到21828條數據,丟了不到50條數據。

水平有限,如有錯誤望指正。還有桌面版的抓取如有解決方法望告知,謝謝。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章