基於Scrapy的爬蟲解決方案

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、背景介紹"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"筆者在業務中遇到了爬蟲需求,由於之前沒做過相關的活兒,所以從網上調研了很多內容。但是互聯網上的信息比較雜亂,且真真假假,特別不方便,所以完成業務後就想寫一篇對初學者友好且較爲完整的文章,希望能對閱讀者有所幫助。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於筆者最近Python用得比較熟練,所以就想用Python語言來完成這個任務。經過一番調研,發現Scrapy框架使用者比較多,文檔也比較全,所以選擇了使用該框架。(其實Scrapy只做了非常簡單的封裝,對於普通的爬蟲任務,使用requests庫和bs4庫中的BeautifulSoup類就完全能解決了)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先簡單介紹一下爬蟲是什麼。爬蟲就是從一個或多個URL鏈接開始,使用某種方法(例如requests庫中的函數)獲取到該URL對應的網頁的內容(一般是HTML格式),然後從該網頁的內容中提取出需要記錄下來的信息和需要繼續爬取的URL鏈接(例如使用上文中提到的BeautifulSoup類)。之後,再對爬取到的URL鏈接進行上述同樣的操作,直到所有URL鏈接都被爬取完,爬蟲程序結束。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Scrapy的官網【1】,英文版官方文檔【2】,第三方的漢化文檔(較爲簡陋和過時)【3】提供如下,感興趣的讀者也可以自行查閱。由於本文重點不在這裏,就不在此處對Scrapy進行介紹了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"}],"text":"【1】:https:\/\/scrapy.org\/"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"}],"text":"【2】:https:\/\/docs.scrapy.org\/en\/latest\/"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"}],"text":"【3】:https:\/\/scrapy-chs.readthedocs.io\/zh_CN\/0.24\/index.html"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、Scrapy使用方法"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"安裝Scrapy庫"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"pip install scrapy"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"新建一個爬蟲項目"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"scrapy startproject your_project_name"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"輸入該命令後,會在當前目錄下新建一個名爲your_project_name的文件夾,該文件夾下的文件層級關係如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"your_project_name\n| scrapy.cfg\n|----your_project_name\n| | __init__.py\n| | items.py\n| | middlewares.py\n| | pipelines.py\n| | settings.py\n| |----spiders\n| | | __init__.py"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中,scrapy.cfg是整個項目的配置文件,spiders目錄下存放爬蟲的邏輯代碼,因爲該項目剛建立,還沒有寫具體的爬蟲代碼,所以該目錄下爲空。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"生成一個爬蟲"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在剛剛新建的項目目錄下輸入命令:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"scrapy genspider example www.qq.com"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中example是爬蟲的名字,www.qq.com是該爬蟲的第一個要爬取的URL鏈接。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"執行該命令後,Scrapy會在spiders目錄下生成一個叫example.py的文件,該文件是一個非常基礎的爬蟲模板。之後要做的事情就是在該py文件裏填入具體的爬蟲邏輯代碼,然後再執行該爬蟲腳本就可以了。example.py文件內的代碼如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"import scrapy\n\n\nclass ExampleSpider(scrapy.Spider):\n name = 'example'\n allowed_domains = ['qq.com']\n start_urls = ['http:\/\/qq.com\/']\n\n def parse(self, response):\n pass"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"代碼中的ExampleSpider就是剛纔生成的爬蟲類。其中,name是爬蟲的名字,allowed_domains是對域名的限制(即該爬蟲只會爬取該限制下的URL域名),start_urls是爬蟲的初始URL鏈接,這裏面的值是剛纔創建爬蟲時輸入的URL鏈接,parse函數是默認的解析函數。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"運行爬蟲"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在項目目錄下執行命令:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"scrapy crawl example"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中example是要運行的爬蟲名字。執行該命令後,該框架就會用example爬蟲裏定義的初始URL鏈接和解析函數去爬取網頁了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"調試爬蟲"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在寫代碼的過程中,由於不同網頁的源碼的組織方式不同,所以需要用一種交互式的方式來訪問網頁,以此來修改代碼。雖然在很多情況下可以通過Chrome瀏覽器F12的審查模式來查看網頁的HTML源碼,但是在有些情況下代碼中獲得的源碼和瀏覽器中看到的卻是不一樣的,所以交互式訪問網頁就必不可少了。(也可以通過運行完整爬蟲的方式來調試代碼,但是效率就有點低下了)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要想交互式訪問網頁,需要在項目目錄下執行命令:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"scrapy shell www.qq.com"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用體驗類似於直接在命令行輸入python進入Python的交互式界面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"完善解析函數"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解析函數的完善是爬蟲的核心步驟。解析函數的初始化如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"def parse(self, response): pass\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中只有response一個實參,該實參就是訪問某個URL鏈接的返回結果,裏面含有該URL鏈接的HTML源碼(該response是對requests.Response類的封裝,所以用法類似,但是包含的成員函數更多)。而解析函數parse的作用就是從response中雜亂的HTML源碼提取出有價值的信息。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在Scrapy框架中,有兩種解析HTML源碼的函數,分別是css和xpath。其中css是Scrapy專有的函數,具體用法只能在Scrapy文檔中查找,不建議使用;而xpath是一種通用的語言(例如BeautifulSoup類中也能使用),它的一些語法的定義在網上資料更多。xpath的具體用法要講的話就太多了,所以這裏不多做介紹,如果有需要,可以直接去搜索引擎查找相關資料。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果需要在解析過程中遇到了需要解析的URL鏈接,則可以直接調用:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"yield scrapy.Request(url_str, callback=self.parse)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中,url_str是需要解析的URL鏈接的字符串,self.parse是解析函數,這裏我使用的是默認的解析函數,當然這裏也能使用自定義的解析函數(自定義解析函數的入參出參類型需要和默認解析函數相同)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"值得注意的是:scrapy.Request除了以上倆必須的參數外,還能通過meta字段來傳遞參數,而參數的獲取能通過 response.meta 來實現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"小建議"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"默認情況下,Scrapy會遵守被爬取網站的robots.txt規則(該文件規定了哪些能爬,哪些不能爬),但往往我們想要爬取的內容都被規定爲不能爬取的內容。可以將settings.py文件中的 ROBOTSTXT_OBEY = True 改爲 ROBOTSTXT_OBEY = False 來避免這種情況的發生。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、常見問題"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"動態網頁不能正確解析"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上述的簡單操作只能解析靜態網頁,需要動態加載的網頁(例如含有Javascript代碼的網頁)則無法正常解析,因爲response裏的HTML源碼是動態加載之前的頁面的源碼,而我們需要的大多是動態加載之後的頁面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以通過在Python中調用Chrome瀏覽器的方式來處理這個問題。除此之外,還能使用Chrome瀏覽器的headless模式。使用了該模式之後,Chrome瀏覽器並不會真的被調用,但是Python中能獲取到和瀏覽器相同的返回結果,而瀏覽器中返回的結果就是動態加載之後的頁面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不過,要使用這個方法,必須在機器上安裝Chrome瀏覽器和對應版本的Chrome驅動程序。安裝完成之後,在middlewares.py文件中加入以下代碼:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"from selenium import webdriver\nfrom scrapy.http import HtmlResponse\n\n\nclass JavaScriptMiddleware:\n def process_request(self, request, spider):\n option = webdriver.ChromeOptions()\n option.add_argument('--headless')\n option.add_argument('--no-sandbox')\n option.add_argument('--disable-gpu')\n driver = webdriver.Chrome(options=option, executable_path=chrome_driver_path_str)\n driver.get(request.url)\n js = 'var q=document.documentElement.scrollTop=10000'\n driver.execute_script(js)\n body = driver.page_source\n return HtmlResponse(driver.current_url, body=body, encoding='utf-8', request=request)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除此之外,還要在settings.py文件中加入以下代碼:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"\nDOWNLOADER_MIDDLEWARES = {\n 'your_project_name.middlewares.JavaScriptMiddleware': 543,\n}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"經過這兩處修改之後,爬蟲腳本里的所有request請求都會通過Chrome headless瀏覽器包裝後再發向要爬取的URL鏈接。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"防爬蟲之修改header"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"很多網站都有各自的反爬蟲機制,但是最基礎的一種方式是檢查請求的HTTP包裏面的header是否正常。其中經常檢查的一個字段是User-Agent,User-Agent字段指的是瀏覽器的型號。反爬蟲機制會檢查該字段是否爲普通的瀏覽器,而普通的爬蟲程序是不會修飾該字段的。如果不顯式將該字段設爲某種瀏覽器型號,就容易觸發反爬蟲,從而不能正常地獲得數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要想修改Scrapy裏的user-agent字段,可以在settings.py文件裏添加以下代碼:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"USER_AGENT = 'Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_15_6) \nAppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/84.0.4147.89 Safari\/537.36'"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"添加完該代碼後,Scrapy在發起request請求時就會將上面的值替換到header中的User-Agent中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"反爬蟲之IP池"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在很多時候,爬取網站時一開始是能正常獲得數據的,但是爬着爬着,就不能正常地獲得數據了。一個很大的可能是IP被該網站封禁了。每個網站封IP的策略都不一樣,但是總體來說其實就是該IP訪問該網站的頻率太高,網站害怕該訪問是惡意攻擊或者擔心服務器承受不了大量的訪問從而直接封禁該IP。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"應對方式也非常粗暴,那就是用代理IP去爬蟲。網站封一個IP,我就用另外的IP去訪問,只要我IP足夠多,就總能獲取到我想要的所有數據。而正好互聯網上就有服務商提供這種IP服務。網上大致分爲免費和付費兩種服務,其中免費提供商提供的IP質量非常低,有不小的概率是直接不能用的,所以這裏不推薦使用免費服務。至於付費服務商網上有很多家都挺靠譜的,本文裏使用的名爲“快代理”的服務商,下面提供的代碼也是隻針對該特定廠家的。不同服務商使用IP池的方式都不一樣,具體使用方法還是以各自的官方文檔爲主。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在“快代理”上購買IP套餐後,在middleware.py文件中添加一下代碼:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"from w3lib.http import basic_auth_header\nimport requests\n\n\nclass ProxyDownloaderMiddleware:\n username = 'your_username'\n password = 'your_password'\n api_url = 'https:\/\/dps.kdlapi.com\/api\/getdps\/?orderid=your_orderid&num=1&pt=1&dedup=1&sep=1'\n proxy_ip_list = []\n list_max_len = 20\n\n def update_ip(self):\n if len(self.proxy_ip_list) != self.list_max_len:\n ip_str = requests.get('https:\/\/dps.kdlapi.com\/api\/getdps\/?orderid=your_orderid&num={}&pt=1&dedup=1&sep=3'.format(self.list_max_len)).text\n self.proxy_ip_list = ip_str.split(' ')\n while True:\n try:\n proxy_ip = self.proxy_ip_list.pop(0)\n proxies = {\n 'http': 'http:\/\/{}:{}@{}'.format(self.username, self.password, proxy_ip),\n 'https': 'http:\/\/{}:{}@{}'.format(self.username, self.password, proxy_ip)\n }\n requests.get('http:\/\/www.baidu.com', proxies=proxies, timeout=3.05)\n self.proxy_ip_list.append(proxy_ip)\n return\n except Exception as e:\n self.proxy_ip_list.append(requests.get(self.api_url).text)\n\n def process_request(self, request, spider):\n self.update_ip()\n request.meta['proxy'] = 'http:\/\/{}'.format(self.proxy_ip_list[-1])\n # 用戶名密碼認證\n request.headers['Proxy-Authorization'] = basic_auth_header(self.username, self.password)\n return None"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中username,password,order id都是“快代理”中使用IP所要用的參數。上面的代碼維護了一個大小爲20的IP池,每次要使用時就提取第一個IP並先要檢查該IP是否已經失效,如果失效了就丟棄並補充新的IP。Scrapy每次發起request請求時,會經過該proxy層的封裝,但要想正常使用,還得在settings.py文件中添加以下代碼:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"DOWNLOADER_MIDDLEWARES = { 'your_project_name.middlewares.ProxyDownloaderMiddleware': 100,}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上文爬取動態頁面的相關內容中也修改了這個 DOWNLOADER_MIDDLEWARES 這個字典。該字典中的key和value分別是在middlewares.py文件中添加的類和封裝request包的順序。如果要同時使用動態頁面爬取和IP池,那麼settings.py文件的該參數應該如下所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"DOWNLOADER_MIDDLEWARES = { 'your_project_name.middlewares.JavaScriptMiddleware': 543, 'your_project_name.middlewares.ProxyDownloaderMiddleware': 100,}\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中100 < 543,代表request請求要先經過代理封裝再經過動態加載封裝,最後才發送給目標URL鏈接。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule"},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"頭圖:Unsplash"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作者:趙宇航"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文:https:\/\/mp.weixin.qq.com\/s\/-jCxnhzo-G9fzZNT-Azp7g"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文:基於Scrapy的爬蟲解決方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"來源:雲加社區 - 微信公衆號 [ID:QcloudCommunity]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"轉載:著作權歸作者所有。商業轉載請聯繫作者獲得授權,非商業轉載請註明出處。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章