文章目錄
1.安裝Scrapy
用pip命令安裝Scrapy:pip install Scrapy
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython$ scrapy -h
Scrapy 1.3.0 - no active project
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
commands
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
[ more ] More commands available when run from project directory
Use "scrapy <command> -h" to see more info about a command
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython$
本篇會用到下面幾個命令:
- startproject:創建一人新項目
- genspider:根據模板生成一個新爬蟲
- crawl:執行爬蟲
- shell:啓動交互式提取控制檯
文檔:http://doc.scrapy.org/latest/topics/commands.html
2.新建項目
輸入scrapy startproject <project_name>
新建項目,這裏使用example_wu
爲項目名。
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架
$ scrapy startproject
Usage
=====
scrapy startproject <project_name> [project_dir]
...
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架
$ scrapy startproject example_wu
New Scrapy project 'example_wu', using template directory '/usr/local/lib/python2.7/dist-packages/scrapy/templates/project', created in:
/home/wu_being/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu
You can start your first spider with:
cd example_wu
scrapy genspider example example.com
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu
$ ls
example_wu scrapy.cfg
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu
$
下面是新建項目的默認目錄結構:
scrapy.cfg
example_wu/
__init__.py
items.py
middlewares.py
pipelines.py
setting.py
spiders/
__init__.py
下面是重要的幾個文件說明:
- scrapy.cfg:設置項目配置(不用修改)
- items.py:定義待提取域的模型
- pipelines.py:處理要提取的域(不用修改)
- setting.py:定義一些設置,如用戶代理、提取延時等
- spiders/:該目錄存儲實際的爬蟲代碼
2.1定義模型
example_wu/items.py
默認代碼如下:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class ExampleWuItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
ExampleWuItem
類是一個模板,需要將其中的內容替換爲爬蟲運行時想要存儲的待提取的國家信息,我們這裏設置只提取國家名稱和人口數量,把默認代碼修改爲:
import scrapy
class ExampleWuItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
population=scrapy.Field()
2.2創建爬蟲
現在我們開始編寫真正的爬蟲代碼,又稱爲spider,通過genspider命令,傳入爬蟲名、域名和可選模板參數:
scrapy genspider country 127.0.0.1:8000/places --template=crawl
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu
$ scrapy genspide
Usage
=====
scrapy genspider [options] <name> <domain>
Generate new spider using pre-defined templates
Options
=======
--help, -h show this help message and exit
--list, -l List available templates
--edit, -e Edit spider after creating it
--dump=TEMPLATE, -d TEMPLATE
Dump template to standard output
--template=TEMPLATE, -t TEMPLATE
Uses a custom template.
--force If the spider already exists, overwrite it with the
template
...
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu
$ scrapy genspider --list
Available templates:
basic
crawl
csvfeed
xmlfeed
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu
$ scrapy genspider country 127.0.0.1:8000/places --template=crawl
Created spider 'country' using template 'crawl' in module:
example_wu.spiders.country
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu
$
這裏使用內置crawl
模板,可以生成更加接近我們想要的國家爬蟲初始版本。運行genspider命令之後,將會生成代碼example_wu/spiders/country.py
。
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class CountrySpider(CrawlSpider):
name = 'country'
#allowed_domains = ['127.0.0.1:8000/places'] ###!!!!這個不是域名
start_urls = ['http://127.0.0.1:8000/places/']
rules = (
Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
i = {}
#i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
#i['name'] = response.xpath('//div[@id="name"]').extract()
#i['description'] = response.xpath('//div[@id="description"]').extract()
return i
該類的屬性名:
- name:定義爬蟲的名稱
- allowed_domains:定義可以提取的域名列表。如果沒有則表示可以提取任何域名!!!
- start_urls:定義爬蟲起始的URL列表。意思爲可用的URL!!!
- rules:定義正則表達式集合,用於告知爬蟲需要跟蹤哪些鏈接。還有一個callback函數,用於解析下載得到的響應,而
parse_urls()
示例方法給我們提供了一個從響應中獲取數據的例子。
文檔:http://doc.scrapy.org/en/latest/topics/spiders.html
2.3優化設置
默認情況下,Scrapy對同一個域名允許最多16個併發下載,並且再次下載之間沒有延時,這樣爬蟲容易被服務器檢測到並被封禁,所以要在example_wu/settings.py
添加幾行代碼:
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 5
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 1
#CONCURRENT_REQUESTS_PER_IP = 16
這裏的延時不是精確的,精確的延時有時也可能被服務器檢測到被封禁,而Scrapy實際在兩次請求的延時添加隨機的偏移量。文檔:http://doc.scrapy.org/en/latest/topics/settings.html
2.4測試爬蟲
使用crawl
運行爬蟲,並附上爬蟲名稱。
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu
$ scrapy crawl country -s LOG_LEVEL=ERROR
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu
$
發現終端日誌沒有輸出錯誤信息,命令的參數LOG_LEVEL=ERROR
等同於在settings.py
加一行LOG_LEVEL='ERROR'
,默認是在終端顯示所有日誌信息。
rules = (
Rule(LinkExtractor(allow='/index/'), follow=True),
Rule(LinkExtractor(allow='/view/'), callback='parse_item'),
)
上面我們添加了兩條規則。第一條規則爬取索引頁並跟蹤其中的鏈接(遞歸爬取鏈接,默認是True),而第二條規則爬取國家頁面並將下載響應傳給callback函數用於提取數據。
...
2017-01-30 00:12:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/> (referer: None)
2017-01-30 00:12:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/default/view/Afghanistan-1> (referer: http://127.0.0.1:8000/places/)
2017-01-30 00:12:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/default/index/1> (referer: http://127.0.0.1:8000/places/)
2017-01-30 00:12:58 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://127.0.0.1:8000/places/default/index/1> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2017-01-30 00:13:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/default/view/Antigua-and-Barbuda-10> (referer: http://127.0.0.1:8000/places/)
......
2017-01-30 00:14:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/default/user/login?_next=%2Fplaces%2Fdefault%2Findex%2F1> (referer: http://127.0.0.1:8000/places/default/index/1)
2017-01-30 00:14:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/default/user/register?_next=%2Fplaces%2Fdefault%2Findex%2F1> (referer: http://127.0.0.1:8000/places/default/index/1)
......
我們發現已經自動過濾了重複鏈接,但結果有多餘的登錄頁和註冊頁,我們可以用正則表達式過濾。
rules = (
Rule(LinkExtractor(allow='/index/', deny='/user/'), follow=True), #False
Rule(LinkExtractor(allow='/view/', deny='/user/'), callback='parse_item'),
)
使用該類的文檔:http://doc.scrapy.org/en/latest/topics/linkextractors.html
2.5使用shell命令提取數據
scrapy提供了shell命令可以下載URL並在python解釋器中給出結果狀態。
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu
$ scrapy shell http://127.0.0.1:8000/places/default/view/47
...
2017-01-30 11:24:21 [scrapy.core.engine] INFO: Spider opened
2017-01-30 11:24:21 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://127.0.0.1:8000/robots.txt> (referer: None)
2017-01-30 11:24:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/default/view/47> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7fd8e6d5cbd0>
[s] item {}
[s] request <GET http://127.0.0.1:8000/places/default/view/47>
[s] response <200 http://127.0.0.1:8000/places/default/view/47>
[s] settings <scrapy.settings.Settings object at 0x7fd8e6d5c5d0>
[s] spider <DefaultSpider 'default' at 0x7fd8e5b24c50>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>>>
下面我們來測試一下。
>>>
>>> response
<200 http://127.0.0.1:8000/places/default/view/47>
>>> response.url
'http://127.0.0.1:8000/places/default/view/47'
>>> response.status
200
>>> item
{}
>>>
>>>
scrapy可以使用lxml提取數據,這裏用CSS選擇器。用extract()
提取數據。
>>> response.css('#places_country__row > td.w2p_fw::text')
[<Selector xpath=u"descendant-or-self::*[@id = 'places_country__row']/td[@class and contains(concat(' ', normalize-space(@class), ' '), ' w2p_fw ')]/text()" data=u'China'>]
>>> name_css='#places_country__row > td.w2p_fw::text'
>>> response.css(name_css)
[<Selector xpath=u"descendant-or-self::*[@id = 'places_country__row']/td[@class and contains(concat(' ', normalize-space(@class), ' '), ' w2p_fw ')]/text()" data=u'China'>]
>>> response.css(name_css).extract()
[u'China']
>>>
>>> pop_css='#places_population__row > td.w2p_fw::text'
>>> response.css(pop_css).extract()
[u'1330044000']
>>>
2.6提取數據保存到文件中
下面是該爬蟲的完整代碼。
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from example_wu.items import ExampleWuItem ###wu
class CountrySpider(CrawlSpider):
name = 'country'
#allowed_domains = ['127.0.0.1:8000/places']####domains!!!!這個不是域名
start_urls = ['http://127.0.0.1:8000/places/']
rules = (
Rule(LinkExtractor(allow='/index/', deny='/user/'), follow=True), #False
Rule(LinkExtractor(allow='/view/', deny='/user/'), callback='parse_item'),
)
def parse_item(self, response):
item = ExampleWuItem() ###wu
item['name'] = response.css('tr#places_country__row td.w2p_fw::text').extract()
item['population'] = response.css('tr#places_population__row td.w2p_fw::text').extract()
return item
要想保存結果,我們可以在parse_item()
方法中添加保存提取數據的代碼,或是定義管道。不過scrapy提供了一個方便的**--output
**選項,用於自動保存提取的數據到CSV、JSON和XML文件中。
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu
$ scrapy crawl country -s LOG_LEVEL=DEBUG
2017-01-30 13:09:52 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: example_wu)
...
2017-01-30 13:09:52 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-01-30 13:09:52 [scrapy.core.engine] INFO: Spider opened
2017-01-30 13:09:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-01-30 13:09:52 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-01-30 13:09:52 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://127.0.0.1:8000/robots.txt> (referer: None)
2017-01-30 13:09:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/> (referer: None)
2017-01-30 13:09:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/default/view/Afghanistan-1> (referer: http://127.0.0.1:8000/places/)
2017-01-30 13:09:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://127.0.0.1:8000/places/default/view/Afghanistan-1>
{'name': [u'Afghanistan'], 'population': [u'29121286']}
2017-01-30 13:09:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/default/index/1> (referer: http://127.0.0.1:8000/places/)
2017-01-30 13:09:53 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://127.0.0.1:8000/places/default/index/1> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2017-01-30 13:09:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/default/view/Antigua-and-Barbuda-10> (referer: http://127.0.0.1:8000/places/)
2017-01-30 13:09:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://127.0.0.1:8000/places/default/view/Antigua-and-Barbuda-10>
{'name': [u'Antigua and Barbuda'], 'population': [u'86754']}
2017-01-30 13:09:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/default/view/Antarctica-9> (referer: http://127.0.0.1:8000/places/)
2017-01-30 13:09:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://127.0.0.1:8000/places/default/view/Antarctica-9>
{'name': [u'Antarctica'], 'population': [u'0']}
... ...
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu
$ scrapy crawl country -s LOG_LEVEL=INFO --output=countries.csv
...
2017-01-30 13:11:33 [scrapy.extensions.feedexport] INFO: Stored csv feed (252 items) in: countries.csv
2017-01-30 13:11:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 160417,
'downloader/request_count': 280,
'downloader/request_method_count/GET': 280,
'downloader/response_bytes': 2844451,
'downloader/response_count': 280,
'downloader/response_status_count/200': 279,
'downloader/response_status_count/400': 1,
'dupefilter/filtered': 61,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 1, 30, 5, 11, 33, 487258),
'item_scraped_count': 252,
'log_count/INFO': 8,
'request_depth_max': 26,
'response_received_count': 280,
'scheduler/dequeued': 279,
'scheduler/dequeued/memory': 279,
'scheduler/enqueued': 279,
'scheduler/enqueued/memory': 279,
'start_time': datetime.datetime(2017, 1, 30, 5, 10, 34, 304933)}
2017-01-30 13:11:33 [scrapy.core.engine] INFO: Spider closed (finished)
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu$
...
提取過程最後還輸出一些統計信息。我們查看輸出文件countries.csv的信息,結果和預期一樣。
name,population
Andorra,84000
American Samoa,57881
Algeria,34586184
Albania,2986952
Aland Islands,26711
Afghanistan,29121286
Antigua and Barbuda,86754
Antarctica,0
Anguilla,13254
... ...
2.7中斷和恢復爬蟲
我們只需要定義用於保存爬蟲當前狀態目錄的JOBDIR設置即可。
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu
$ scrapy crawl country -s LOG_LEVEL=DEBUG -s JOBDIR=2.7crawls/country
...
^C2017-01-30 13:31:27 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force
2017-01-30 13:33:13 [scrapy.core.scraper] DEBUG: Scraped from <200 http://127.0.0.1:8000/places/default/view/Albania-3>
{'name': [u'Albania'], 'population': [u'2986952']}
2017-01-30 13:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/default/view/Aland-Islands-2> (referer: http://127.0.0.1:8000/places/)
2017-01-30 13:33:16 [scrapy.core.scraper] DEBUG: Scraped from <200 http://127.0.0.1:8000/places/default/view/Aland-Islands-2>
{'name': [u'Aland Islands'], 'population': [u'26711']}
...
我們通過按Ctrl+C發送終止信號,然後等待爬蟲再下載幾個條目後自動終止,注意不能再按一次Ctrl+C強行終止,否則爬蟲保存狀態不成功。
我們運行同樣的命令恢復爬蟲運行。
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu $
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu
$ scrapy crawl country -s LOG_LEVEL=DEBUG -s JOBDIR=2.7crawls/country
...
2017-01-30 13:33:21 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://127.0.0.1:8000/robots.txt> (referer: None)
2017-01-30 13:33:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/default/view/Barbados-20> (referer: http://127.0.0.1:8000/places/default/index/1)
2017-01-30 13:33:23 [scrapy.core.scraper] DEBUG: Scraped from <200 http://127.0.0.1:8000/places/default/view/Barbados-20>
{'name': [u'Barbados'], 'population': [u'285653']}
2017-01-30 13:33:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/places/default/view/Bangladesh-19> (referer: http://127.0.0.1:8000/places/default/index/1)
2017-01-30 13:33:25 [scrapy.core.scraper] DEBUG: Scraped from <200 http://127.0.0.1:8000/places/default/view/Bangladesh-19>
{'name': [u'Bangladesh'], 'population': [u'156118464']}
...
^C2017-01-30 13:33:43 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force
2017-01-30 13:33:43 [scrapy.core.engine] INFO: Closing spider (shutdown)
^C2017-01-30 13:33:44 [scrapy.crawler] INFO: Received SIGINT twice, forcing unclean shutdown
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/8.Scrapy爬蟲框架/example_wu$
恢復時注意cookie過期問題。文檔:http://doc.scrapy.org/en/latest/topics/jobs.html
3.使用Portia編寫可視化爬蟲
Portia是一款基於scrapy開發的開源工具,該工具可以通過點擊要提取的網頁部分來創建爬蟲,這樣就比手式創建CSS選擇器的方式更加方便。文檔:https://github.com/scrapinghub/portia#running-portia
3.1安裝
先使用virtualenv創建一個虛擬python環境,環境名爲portia_examle
。
pip install virtualenv
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython$ virtualenv portia_examle --no-site-packages
New python executable in /home/wu_being/GitHub/WebScrapingWithPython/portia_examle/bin/python
Installing setuptools, pip, wheel...done.
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython$ source portia_examle/bin/activate
(portia_examle) wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython$
(portia_examle) wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython$ cd portia_examle/
(portia_examle) wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/portia_examle$
在virtualenv中安裝Portia及依賴。
(portia_examle) wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/portia_examle$
git clone https://github.com/scrapinghub/portia
cd portia
pip install -r requirements.txt
pip install -e ./slybot
cd slyd
twistd -n slyd
如果安裝成功,在瀏覽器中訪問到Portia工具http://localhost:9001/static/main.html
3.2標註
- Portia啓動項,有一個用於輸入提取網頁URL的文本框,輸入http://127.0.0.1:8000/places/ 。默認情況下,項目名被設爲
new_project
,而爬蟲名被設爲待爬取域名127.0.0.1:8000/places/
,這兩項都通過單擊相應標籤進行修改。 - 單擊
Annotate this page
按鈕,然後單擊國家人口數量。 - 單擊
+field
按鈕創建一個名爲population的新域,然後單擊Done
保存。其他的域也是相同操作。 - 完成標註後,單擊頂部的藍色按鈕
Continue Browsing
。
3.3優化爬蟲
標註完成後,Portia會生成一個scrapy項目,並將產生的文件保存到data/projects目錄中,要運行爬蟲,只需執行portiacrawl命令,並帶上項目名和爬蟲名。
(portia_examle) wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/portia_examle$
portiacrawl portia/slyd/data/projects/new_project
如果爬蟲默認設置運行太快就遇到服務器錯誤
portiacrawl portia/slyd/data/projects/new_project 127.0.0.1:8000/places/ -s DOWNLOAD_DELAY = 2 -s CONCURRENT_REQUESTS_PER_DOMAIN = 1
配置右邊欄面板中的Crawling
選項卡中,可以添加/index/
和/view/
爲爬蟲跟蹤模式,將/user/
爲排除模式,並勾選Overlay blocked links
複選框。
3.4檢查結果
(portia_examle) wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/portia_examle$
portiacrawl portia/slyd/data/projects/new_project 127.0.0.1:8000/places/ --output=countries.csv -s DOWNLOAD_DELAY = 2 -s CONCURRENT_REQUESTS_PER_DOMAIN = 1
Portia是一個非常方便的與Scrapy配合的工具。對於簡單的網站,使用Portia開發爬蟲通常速度更快。而對於複雜的網站(比如依賴JavaScript的界面),則可以選擇使用Python直接開發Scrapy爬蟲。
4.使用Scrapely實現自動化提取
Portia使用了Scrapely庫來訓練數據建立從網頁中提取哪些內容的模型,並在相同結構的其他網頁應用該模型。
https://github.com/scrapy/scrapely
(portia_examle) wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/portia_examle$ python
Python 2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from scrapely import Scraper
>>> s=Scraper()
>>> train_url='http://127.0.0.1:8000/places/default/view/47'
>>> s.train(train_url,{'name':'China','population':'1330044000'})
>>> test_url='http://127.0.0.1:8000/places/default/view/239'
>>> s.scrape(test_url)
[{u'name':[u'United Kingdom'],u'population':[u'62,348,447']}]
Wu_Being 博客聲明:本人博客歡迎轉載,請標明博客原文和原鏈接!謝謝!
【Python爬蟲系列】《【Python爬蟲8】Scrapy 爬蟲框架》http://blog.csdn.net/u014134180/article/details/55508259
Python爬蟲系列的GitHub代碼文件:https://github.com/1040003585/WebScrapingWithPython