基於Scrapy的爬蟲解決方案

原創

2021-06-23 10:43

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、背景介紹"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"筆者在業務中遇到了爬蟲需求，由於之前沒做過相關的活兒，所以從網上調研了很多內容。但是互聯網上的信息比較雜亂，且真真假假，特別不方便，所以完成業務後就想寫一篇對初學者友好且較爲完整的文章，希望能對閱讀者有所幫助。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於筆者最近Python用得比較熟練，所以就想用Python語言來完成這個任務。經過一番調研，發現Scrapy框架使用者比較多，文檔也比較全，所以選擇了使用該框架。（其實Scrapy只做了非常簡單的封裝，對於普通的爬蟲任務，使用requests庫和bs4庫中的BeautifulSoup類就完全能解決了）。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先簡單介紹一下爬蟲是什麼。爬蟲就是從一個或多個URL鏈接開始，使用某種方法（例如requests庫中的函數）獲取到該URL對應的網頁的內容（一般是HTML格式），然後從該網頁的內容中提取出需要記錄下來的信息和需要繼續爬取的URL鏈接（例如使用上文中提到的BeautifulSoup類）。之後，再對爬取到的URL鏈接進行上述同樣的操作，直到所有URL鏈接都被爬取完，爬蟲程序結束。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Scrapy的官網【1】，英文版官方文檔【2】，第三方的漢化文檔（較爲簡陋和過時）【3】提供如下，感興趣的讀者也可以自行查閱。由於本文重點不在這裏，就不在此處對Scrapy進行介紹了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"}],"text":"【1】：https:\/\/scrapy.org\/"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"}],"text":"【2】：https:\/\/docs.scrapy.org\/en\/latest\/"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"}],"text":"【3】：https:\/\/scrapy-chs.readthedocs.io\/zh_CN\/0.24\/index.html"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、Scrapy使用方法"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"安裝Scrapy庫"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"pip install scrapy"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"新建一個爬蟲項目"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"scrapy startproject your_project_name"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"輸入該命令後，會在當前目錄下新建一個名爲your_project_name的文件夾，該文件夾下的文件層級關係如下："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"your_project_name\n| scrapy.cfg\n|----your_project_name\n| | __init__.py\n| | items.py\n| | middlewares.py\n| | pipelines.py\n| | settings.py\n| |----spiders\n| | | __init__.py"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中，scrapy.cfg是整個項目的配置文件，spiders目錄下存放爬蟲的邏輯代碼，因爲該項目剛建立，還沒有寫具體的爬蟲代碼，所以該目錄下爲空。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"生成一個爬蟲"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在剛剛新建的項目目錄下輸入命令："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"scrapy genspider example www.qq.com"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中example是爬蟲的名字，www.qq.com是該爬蟲的第一個要爬取的URL鏈接。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"執行該命令後，Scrapy會在spiders目錄下生成一個叫example.py的文件，該文件是一個非常基礎的爬蟲模板。之後要做的事情就是在該py文件裏填入具體的爬蟲邏輯代碼，然後再執行該爬蟲腳本就可以了。example.py文件內的代碼如下："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"import scrapy\n\n\nclass ExampleSpider(scrapy.Spider):\n name = 'example'\n allowed_domains = ['qq.com']\n start_urls = ['http:\/\/qq.com\/']\n\n def parse(self, response):\n pass"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"代碼中的ExampleSpider就是剛纔生成的爬蟲類。其中，name是爬蟲的名字，allowed_domains是對域名的限制（即該爬蟲只會爬取該限制下的URL域名），start_urls是爬蟲的初始URL鏈接，這裏面的值是剛纔創建爬蟲時輸入的URL鏈接，parse函數是默認的解析函數。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"運行爬蟲"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在項目目錄下執行命令:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"scrapy crawl example"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中example是要運行的爬蟲名字。執行該命令後，該框架就會用example爬蟲裏定義的初始URL鏈接和解析函數去爬取網頁了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"調試爬蟲"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在寫代碼的過程中，由於不同網頁的源碼的組織方式不同，所以需要用一種交互式的方式來訪問網頁，以此來修改代碼。雖然在很多情況下可以通過Chrome瀏覽器F12的審查模式來查看網頁的HTML源碼，但是在有些情況下代碼中獲得的源碼和瀏覽器中看到的卻是不一樣的，所以交互式訪問網頁就必不可少了。（也可以通過運行完整爬蟲的方式來調試代碼，但是效率就有點低下了）。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要想交互式訪問網頁，需要在項目目錄下執行命令："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"scrapy shell www.qq.com"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用體驗類似於直接在命令行輸入python進入Python的交互式界面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"完善解析函數"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解析函數的完善是爬蟲的核心步驟。解析函數的初始化如下："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"def parse(self, response): pass\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中只有response一個實參，該實參就是訪問某個URL鏈接的返回結果，裏面含有該URL鏈接的HTML源碼（該response是對requests.Response類的封裝，所以用法類似，但是包含的成員函數更多）。而解析函數parse的作用就是從response中雜亂的HTML源碼提取出有價值的信息。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在Scrapy框架中，有兩種解析HTML源碼的函數，分別是css和xpath。其中css是Scrapy專有的函數，具體用法只能在Scrapy文檔中查找，不建議使用；而xpath是一種通用的語言（例如BeautifulSoup類中也能使用），它的一些語法的定義在網上資料更多。xpath的具體用法要講的話就太多了，所以這裏不多做介紹，如果有需要，可以直接去搜索引擎查找相關資料。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果需要在解析過程中遇到了需要解析的URL鏈接，則可以直接調用："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"yield scrapy.Request(url_str, callback=self.parse)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中，url_str是需要解析的URL鏈接的字符串，self.parse是解析函數，這裏我使用的是默認的解析函數，當然這裏也能使用自定義的解析函數（自定義解析函數的入參出參類型需要和默認解析函數相同）。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"值得注意的是：scrapy.Request除了以上倆必須的參數外，還能通過meta字段來傳遞參數，而參數的獲取能通過 response.meta 來實現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"小建議"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"默認情況下，Scrapy會遵守被爬取網站的robots.txt規則（該文件規定了哪些能爬，哪些不能爬），但往往我們想要爬取的內容都被規定爲不能爬取的內容。可以將settings.py文件中的 ROBOTSTXT_OBEY = True 改爲 ROBOTSTXT_OBEY = False 來避免這種情況的發生。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、常見問題"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"動態網頁不能正確解析"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上述的簡單操作只能解析靜態網頁，需要動態加載的網頁（例如含有Javascript代碼的網頁）則無法正常解析，因爲response裏的HTML源碼是動態加載之前的頁面的源碼，而我們需要的大多是動態加載之後的頁面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以通過在Python中調用Chrome瀏覽器的方式來處理這個問題。除此之外，還能使用Chrome瀏覽器的headless模式。使用了該模式之後，Chrome瀏覽器並不會真的被調用，但是Python中能獲取到和瀏覽器相同的返回結果，而瀏覽器中返回的結果就是動態加載之後的頁面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不過，要使用這個方法，必須在機器上安裝Chrome瀏覽器和對應版本的Chrome驅動程序。安裝完成之後，在middlewares.py文件中加入以下代碼："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"from selenium import webdriver\nfrom scrapy.http import HtmlResponse\n\n\nclass JavaScriptMiddleware:\n def process_request(self, request, spider):\n option = webdriver.ChromeOptions()\n option.add_argument('--headless')\n option.add_argument('--no-sandbox')\n option.add_argument('--disable-gpu')\n driver = webdriver.Chrome(options=option, executable_path=chrome_driver_path_str)\n driver.get(request.url)\n js = 'var q=document.documentElement.scrollTop=10000'\n driver.execute_script(js)\n body = driver.page_source\n return HtmlResponse(driver.current_url, body=body, encoding='utf-8', request=request)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除此之外，還要在settings.py文件中加入以下代碼："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"\nDOWNLOADER_MIDDLEWARES = {\n 'your_project_name.middlewares.JavaScriptMiddleware': 543,\n}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"經過這兩處修改之後，爬蟲腳本里的所有request請求都會通過Chrome headless瀏覽器包裝後再發向要爬取的URL鏈接。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"防爬蟲之修改header"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"很多網站都有各自的反爬蟲機制，但是最基礎的一種方式是檢查請求的HTTP包裏面的header是否正常。其中經常檢查的一個字段是User-Agent，User-Agent字段指的是瀏覽器的型號。反爬蟲機制會檢查該字段是否爲普通的瀏覽器，而普通的爬蟲程序是不會修飾該字段的。如果不顯式將該字段設爲某種瀏覽器型號，就容易觸發反爬蟲，從而不能正常地獲得數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要想修改Scrapy裏的user-agent字段，可以在settings.py文件裏添加以下代碼："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"USER_AGENT = 'Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_15_6) \nAppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/84.0.4147.89 Safari\/537.36'"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"添加完該代碼後，Scrapy在發起request請求時就會將上面的值替換到header中的User-Agent中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"反爬蟲之IP池"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在很多時候，爬取網站時一開始是能正常獲得數據的，但是爬着爬着，就不能正常地獲得數據了。一個很大的可能是IP被該網站封禁了。每個網站封IP的策略都不一樣，但是總體來說其實就是該IP訪問該網站的頻率太高，網站害怕該訪問是惡意攻擊或者擔心服務器承受不了大量的訪問從而直接封禁該IP。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"應對方式也非常粗暴，那就是用代理IP去爬蟲。網站封一個IP，我就用另外的IP去訪問，只要我IP足夠多，就總能獲取到我想要的所有數據。而正好互聯網上就有服務商提供這種IP服務。網上大致分爲免費和付費兩種服務，其中免費提供商提供的IP質量非常低，有不小的概率是直接不能用的，所以這裏不推薦使用免費服務。至於付費服務商網上有很多家都挺靠譜的，本文裏使用的名爲“快代理”的服務商，下面提供的代碼也是隻針對該特定廠家的。不同服務商使用IP池的方式都不一樣，具體使用方法還是以各自的官方文檔爲主。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在“快代理”上購買IP套餐後，在middleware.py文件中添加一下代碼："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"from w3lib.http import basic_auth_header\nimport requests\n\n\nclass ProxyDownloaderMiddleware:\n username = 'your_username'\n password = 'your_password'\n api_url = 'https:\/\/dps.kdlapi.com\/api\/getdps\/?orderid=your_orderid&num=1&pt=1&dedup=1&sep=1'\n proxy_ip_list = []\n list_max_len = 20\n\n def update_ip(self):\n if len(self.proxy_ip_list) != self.list_max_len:\n ip_str = requests.get('https:\/\/dps.kdlapi.com\/api\/getdps\/?orderid=your_orderid&num={}&pt=1&dedup=1&sep=3'.format(self.list_max_len)).text\n self.proxy_ip_list = ip_str.split(' ')\n while True:\n try:\n proxy_ip = self.proxy_ip_list.pop(0)\n proxies = {\n 'http': 'http:\/\/{}:{}@{}'.format(self.username, self.password, proxy_ip),\n 'https': 'http:\/\/{}:{}@{}'.format(self.username, self.password, proxy_ip)\n }\n requests.get('http:\/\/www.baidu.com', proxies=proxies, timeout=3.05)\n self.proxy_ip_list.append(proxy_ip)\n return\n except Exception as e:\n self.proxy_ip_list.append(requests.get(self.api_url).text)\n\n def process_request(self, request, spider):\n self.update_ip()\n request.meta['proxy'] = 'http:\/\/{}'.format(self.proxy_ip_list[-1])\n # 用戶名密碼認證\n request.headers['Proxy-Authorization'] = basic_auth_header(self.username, self.password)\n return None"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中username，password，order id都是“快代理”中使用IP所要用的參數。上面的代碼維護了一個大小爲20的IP池，每次要使用時就提取第一個IP並先要檢查該IP是否已經失效，如果失效了就丟棄並補充新的IP。Scrapy每次發起request請求時，會經過該proxy層的封裝，但要想正常使用，還得在settings.py文件中添加以下代碼："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"DOWNLOADER_MIDDLEWARES = { 'your_project_name.middlewares.ProxyDownloaderMiddleware': 100,}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上文爬取動態頁面的相關內容中也修改了這個 DOWNLOADER_MIDDLEWARES 這個字典。該字典中的key和value分別是在middlewares.py文件中添加的類和封裝request包的順序。如果要同時使用動態頁面爬取和IP池，那麼settings.py文件的該參數應該如下所示："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"DOWNLOADER_MIDDLEWARES = { 'your_project_name.middlewares.JavaScriptMiddleware': 543, 'your_project_name.middlewares.ProxyDownloaderMiddleware': 100,}\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中100 < 543，代表request請求要先經過代理封裝再經過動態加載封裝，最後才發送給目標URL鏈接。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule"},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"頭圖：Unsplash"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作者：趙宇航"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文：https:\/\/mp.weixin.qq.com\/s\/-jCxnhzo-G9fzZNT-Azp7g"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文：基於Scrapy的爬蟲解決方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"來源：雲加社區 - 微信公衆號 [ID：QcloudCommunity]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"轉載：著作權歸作者所有。商業轉載請聯繫作者獲得授權，非商業轉載請註明出處。"}]}]}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

歐洲英國德國法國TikTok與YouTube海外網紅達人的完美合作策略

【本篇由言同數字科技有限公司原創】在當今數字營銷時代，TikTok已成爲一種受歡迎的社交媒體平臺，尤其在年輕人中頗具影響力。而其中的直播帶貨更是吸引了衆多品牌的注意，成爲推廣產品和增加銷售的重要途徑。下面言同數字將針對海外TikTok網紅直

2024-05-03 22:36:01

ollama使用

ollama 僅支持。gguf的格式其他格式需要llama.cpp 轉換 curl https://ollama.ai/install.sh | sh ollama --version ollama pull llama2-chin

2024-05-01 00:42:55

「Qt Widget中文示例指南」如何實現一個快捷編輯器（一）

Qt 是目前最先進、最完整的跨平臺C++開發工具。它不僅完全實現了一次編寫，所有平臺無差別運行，更提供了幾乎所有開發過程中需要用到的工具。如今，Qt已被運用於超過70個行業、數千家企業，支持數百萬設備及應用。快捷編輯器示例展示瞭如何創建一

2024-04-30 23:36:29

解鎖HDC 2024之旅：從購票到報名，全程攻略

本文分享自華爲雲社區《解鎖HDC 2024之旅：從購票到報名，全程攻略》，作者：華爲雲社區精選。 Hi，代碼界的小夥伴們，集結號已經吹響了！華爲開發者大會（HDC 2024）——這場匯聚了HarmonyOS NEXT鴻蒙星河版、盤古大模型5

2024-04-30 22:34:35

銀行核心背後的落地工程體系丨Oracle - TiDB 數據遷移詳解

本文作者：張顯華，孟凡輝，莊培培系列導讀：徐戟（白鱔）數據庫技術專家，Oracle ACE，PostgreSQL ACE Director 當前，國內大量的關鍵行業的核心繫統正在實現國產化替代，而與此同時，這些行業的數字化轉型也正在進入

2024-04-30 22:24:59

30 秒出服裝設計稿，森馬用函數計算+AIGC 整“新活”!

創新項目如何去賦能我們的業務，這件事情在森馬很重要。阿里雲函數計算幫我們屏蔽掉了想把AI落地到實際業務場景中 GPU 算力資源儲備、採購成本、技術門檻等很多難題，從而迅速做出決策，快人一步站在正確的起點，體驗新技術對整個服裝爆款設計、營銷

2024-04-30 21:12:14

消金公司2023財報解析：息差維持高位，信用成本攀升

來源 | 鐳射財經（leishecaijing） 2023年，是持牌消金行業承上啓下的關鍵一年，也是鍛造韌性、比拼內功最緊張的一年。一方面，住戶短期消費貸款餘額在2022年觸底後，伴隨經濟復甦、消費提振，於2023年重新回到上行軌道。短

2024-04-30 13:11:32

Linux下製作Nginx綠色免安裝包

前言 linux下安裝nginx比較繁瑣，遇到內網部署環境更是麻煩，所以研究了下nginx綠色免安裝版的部署包製作，開箱即用，特此記錄分享，一下操作在centos8環境下安裝，如果需要其他內核系統的安裝（Debian/Ubuntu等），請在

2024-04-29 21:38:23

數字化轉型新篇章：企業通往智能化的新範式

早在十多年前，一些具有前瞻視野的企業以實現“數字化”爲目標啓動轉型實踐。但時至今日，可以說尚無幾家企業能夠在真正意義上實現“數字化”。在實現“數字化”的征途上，人們發現，努力愈進，彷彿終點愈遠。究其原因，還在於轉型一直落後於技術邊界的拓展

2024-04-29 21:22:20

MindSpore強化學習：使用PPO配合環境HalfCheetah-v2進行訓練

本文分享自華爲雲社區《MindSpore強化學習：使用PPO配合環境HalfCheetah-v2進行訓練》，作者： irrational。半獵豹（Half Cheetah）是一個基於MuJoCo的強化學習環境，由P. Wawrzyński

2024-04-29 10:33:13

圖片旋轉後保存到數據庫

1、圖片通過canvas繪製 2、canvas旋轉 3、canvas 轉成blob 在實例化成文件 4、創建formData裏面append放入文件和其他的參數，再調上傳接口 <div style=" heig

2024-04-29 10:16:22

記一次北京某大學邏輯漏洞挖掘

0x01 信息收集個人覺得教育src的漏洞挖掘就不需要找真實IP了，我們直接進入正題，收集某大學的子域名，可以用oneforall，這裏給大家推薦一個在線查詢子域名的網站：https://www.virustotal.com/ 收集到的子

2024-04-28 22:47:25

1 名工程師輕鬆管理 20 個工作流，創業企業用 Serverless 讓數據處理流程提效

作者：嶽洋、陳德全、劉靜娜北京語勢科技有限公司成立於 2023 年 6 月，語勢科技定位爲“智能投資時代的主題入口”，在資管行業從以機構爲核心轉向以用戶爲核心的變革時代，通過打造主題投資引擎，賦能普惠投資一體化，打造以投資者和資管機構爲主

2024-04-28 21:12:22

實用分享！用Axure RP構建交互的5個小技巧

Axure RP是一套專門爲網站或應用程序所設計的快速原型設計工具，可以讓應用網站策劃人員或網站功能界面設計師更加快速方便的建立Web AP和Website的線框圖、流程圖、原型和規格。在Axure RP中，交互是創建豐富而逼真的原型的

2024-04-28 11:35:53

LoRA微調語言大模型的實用技巧

一、引言隨着深度學習技術的快速發展，語言大模型在自然語言處理領域取得了顯著的進展。然而，傳統的微調方法通常需要大量的計算資源和時間，對於實際應用來說並不友好。爲了解決這個問題，LoRA微調技術應運而生。LoRA（Low-Rank Adap

2024-04-28 11:30:13

24小時熱門文章

最新文章

最新評論文章