文章目錄
基本介紹
上面有篇博客專門對scrapy入門爬取進行了一個簡單介紹,而且實現了對新聞網站數據的爬取,這次我們將要爬取360上面的美食圖片。我們將圖片相關的信息保存在MYSQL和MongDB數據庫中,首先我們需要安裝好MYSQL和MongDB數據庫,安裝這一塊大家可以參考網上的安裝教程。
需求分析
首先我們需要了解我們爬取的目標網站:https://image.so.com/z?ch=food,打開這個網頁,我們會發現很多美食圖片,這個時候我們打開谷歌的開發者工具,然後切換到XHR選項,不斷下拉,會呈現很多Ajax請求,如下圖:
我們打開一個請求的詳情:
返回的格式是JSON。其中list就是一張張圖片的詳細信息,包含了30張圖片的ID,名稱,鏈接等信息。我們另外觀察Ajax請求的參數信息,有一個參數sn一直在變化,sn爲30就返回前30張圖片,由此類推,其中ch參數代表類別,listtype是排序方式,其他參數不用管,我們在翻頁請求的時候就改變sn參數就可以了。
新建項目
首先我們新建一個項目,在指定的文件夾位置新建,我們利用cmd命令窗口:
cd C:\Users\lixue\Desktop\test
然後創建項目,並創建一個spiders:
scrapy startproject image360
scrapy genspider images.so.com
這兩條命令分別運行,運行完就生成了那個文件夾以及spider。
構造請求
接下來我們定義爬取的頁數,我這裏爬取30頁,每頁30張,一共900張圖片我們可以先在settings.py裏面定義一個變量MAX_PAGE,添加如下定義:
MAX_PAGE=30
接下來我們定義spider中start_requests()方法 ,來生成30次請求:
class ImagesSpider(Spider):
name = 'images'
allowed_domains = ['images.so.com']
start_urls = ['http://images.so.com/']
def start_requests(self):
data ={'ch':'food','listtype':'new'}
base_url ='https://image.so.com/zjl?'
for page in range(1,self.settings.get('MAX_PAGE')+1):
data['sn']= page*30
params =urlencode(data)
url = base_url +params
yield Request(url,self.parse)
我們首先定義初始不定的兩個參數ch和listtype,然後sn參數是遍歷循環生成的,利用urllencode()將字典轉化爲URL請求的GET參數,從而構成完整的URL,構造並生成Request,然後還要引入以下模塊:
from scrapy import Spider, Request
from urllib.parse import urlencode
當我們後面爬取的時候還需要修改setting中的ROBOTSTXT_OBEY,否則無法抓取:
ROBOTSTXT_OBEY = False
接下來,我們可以試着爬取一下;
scrapy crawl images
我們發現返回的都是200,說明請求正常。
提取信息
我們首先需要新建一個Item,叫做ImageItem,如下所示:
from scrapy import Item,Field
class ImageItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
collection = table ='images'
id =Field()
url =Field()
title =Field()
thumb =Field()
這裏定義了四個字段,採集包括圖片的標題,ID,縮略圖,鏈接。另外兩個屬性collection和table都定義爲字符串,分別代表MongDB和MySQL存儲的Colllection和表名稱。
接下來我們來編寫提取上面這幾個字段一塊的相關信息,將parse()方法改寫爲如下所示:
def parse(self, response):
result =json.loads(response.text)
for image in result.get('list'):
item =ImageItem()
item['id'] = image.get('id')
item['url'] =image.get('qhimg_url')
item['title']=image.get('title')
item['thumb'] = image.get('qhimg_downurl')
yield item
首先json解析,然後遍歷提取相關信息。再對ImangeItem賦值,生成Item對象。
存儲信息
這一塊我們用了MongDB和MySQL ,但在這裏我只以爲MySQL爲例做說明,在做後面存儲之前,我們首先要確保MySQL的安裝和正常使用。
MySQLPipeline
首先新建一個數據庫,名字還是以image360,SQL語句爲:
CREATE DATABASE images360 DEFAULT CHARACTER SET UF8 COLLATE UTF8_general_ci
然後我們再來創建數據表:
CREATE TABLE `images` (
`id` varchar(255) DEFAULT NULL,
`url` varchar(255) NOT NULL,
`title` varchar(255) DEFAULT NULL,
`thumb` varchar(255) DEFAULT NULL,
PRIMARY KEY (`url`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
執行完SQL語句之後,我們就成功創建好了數據表,接下來我們就可以往表裏面存儲數據。
我們來實現一個MySQLPipeline,代碼如下所示:
class MysqlPipeline():
def __init__(self, host, database, user, password, port):
self.host = host
self.database = database
self.user = user
self.password = password
self.port = port
@classmethod
def from_crawler(cls, crawler):
return cls(
host=crawler.settings.get('MYSQL_HOST'),
database=crawler.settings.get('MYSQL_DATABASE'),
user=crawler.settings.get('MYSQL_USER'),
password=crawler.settings.get('MYSQL_PASSWORD'),
port=crawler.settings.get('MYSQL_PORT'),
)
def open_spider(self, spider):
self.db = pymysql.connect(self.host, self.user, self.password, self.database, charset='utf8',
port=self.port)
self.cursor = self.db.cursor()
def close_spider(self, spider):
self.db.close()
def process_item(self, item, spider):
print(item['title'])
data = dict(item)
keys = ', '.join(data.keys())
values = ', '.join(['%s'] * len(data))
sql = 'insert into %s (%s) values (%s)' % (item.table, keys, values)
self.cursor.execute(sql, tuple(data.values()))
self.db.commit()
return item
這裏我們插入數據採取的是動態構造SQL語句的方法,此外我們需要設置MySQL的配置,我們在settings.py裏添加幾個變量,如下所示:
MONGO_URI = 'localhost'
MONGO_DB = 'image360'
MYSQL_HOST = 'localhost'
MYSQL_DATABASE = 'image360'
MYSQL_USER = 'root'
MYSQL_PASSWORD = '123456'
MYSQL_PORT = 3306
定義了數據庫的配置,這樣MySQLPipeline就完成了。
Image Pipeline
我們下面再來看看Image Pipeline的構造,scrapy 專門提供了處理下載的Pipeline,保存文件下載和圖片下載,下載原理和爬取網頁原理是一樣的,下載過程支持多線程和異步,下載十分高效。
我們首先要定義存儲文件的路徑,需要定義一個IMAGES_STORE 變量,在settings中添加這個:
IMAGES_STORE = './images'
即所有下載圖片都存放在這個文件夾中,下面我們來看看我編寫的:
class ImagePipeline(ImagesPipeline):
def file_path(self, request, response=None, info=None):
url = request.url
file_name = url.split('/')[-1]
return file_name
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem('Image Downloaded Failed')
return item
def get_media_requests(self, item, info):
yield Request(item['url'])
get_media_requests是調用爬取 的Item對象,我們將它的url字段提取出來,然後直接生成Request對象,在將Request對象加入調度隊列中,等待被調度,執行下載。
file_path()主要是爲了構造存儲後的文件名。
item_completed()是單個Item完成下載時處理方法,並不是每一個都下載成功,我們需要剔除下載失敗的就不需要保存這個Item 到數據庫中。
MongDB Pipeline
這裏我就直接給出代碼,不做過多介紹:
class MongoPipeline(object):
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DB')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def process_item(self, item, spider):
name = item.collection
self.db[name].insert(dict(item))
return item
def close_spider(self, spider):
self.client.close()
我們還需要改settingst添加一些設置,即是存儲到MongDB的鏈接地址和數據庫名稱。我們需要在settings添加這兩個變量:
MONGO_URI = 'localhost'
MONGO_DB = 'image360'
到這裏,三個Item Pipeline的定義就完成了,最後我們啓用就行,修改settings.py中ITEM_PIPELINES,如下所示:
ITEM_PIPELINES = {
'image360.pipelines.ImagePipeline': 300,
'image360.pipelines.MongoPipeline': 301,
'image360.pipelines.MysqlPipeline': 302,
}
最後我們來運行程序,進行爬取:
scrapy crawl images
爬蟲的輸出日誌爲:
我們還可以看看保存的圖片以及數據庫存儲的信息:
我採集這個圖片純粹就是爲了好玩,看到的不要打我
下面這個是數據庫存儲的信息:
部分代碼
最後我附上修改比較多的模塊代碼,可能會有一些路徑等設置,大家記得要改成和自己電腦的路徑一致,然後如果有更好的意見可以和我聯繫改進這個爬蟲。
1.imange.py
from scrapy import Spider, Request
from urllib.parse import urlencode
import json
from image360.items import ImageItem
class ImagesSpider(Spider):
name = 'images'
allowed_domains = ['images.so.com']
start_urls = ['http://images.so.com/']
def start_requests(self):
data ={'ch':'food','listtype':'new'}
base_url ='https://image.so.com/zjl?'
for page in range(1,self.settings.get('MAX_PAGE')+1):
data['sn']= page*30
params =urlencode(data)
url = base_url +params
yield Request(url,self.parse)
def parse(self, response):
result =json.loads(response.text)
for image in result.get('list'):
item =ImageItem()
item['id'] = image.get('id')
item['url'] =image.get('qhimg_url')
item['title']=image.get('title')
item['thumb'] = image.get('qhimg_downurl')
yield item
2.settings.py
BOT_NAME = 'image360'
MAX_PAGE =30
SPIDER_MODULES = ['image360.spiders']
NEWSPIDER_MODULE = 'image360.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'image360 (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'image360.middlewares.Image360SpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'image360.middlewares.Image360DownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'image360.pipelines.Image360Pipeline': 300,
#}
ITEM_PIPELINES = {
'image360.pipelines.ImagePipeline': 300,
'image360.pipelines.MongoPipeline': 301,
'image360.pipelines.MysqlPipeline': 302,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
MONGO_URI = 'localhost'
MONGO_DB = 'image360'
MYSQL_HOST = 'localhost'
MYSQL_DATABASE = 'image360'
MYSQL_USER = 'root'
MYSQL_PASSWORD = '123456'
MYSQL_PORT = 3306
IMAGES_STORE = './images'
3.items.py
from scrapy import Item,Field
class ImageItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
collection = table ='images'
id =Field()
url =Field()
title =Field()
thumb =Field()
4.pipelines.py
import pymongo
import pymysql
from scrapy import Request
from scrapy.exceptions import DropItem
from scrapy.pipelines.images import ImagesPipeline
class MongoPipeline(object):
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DB')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def process_item(self, item, spider):
name = item.collection
self.db[name].insert(dict(item))
return item
def close_spider(self, spider):
self.client.close()
class MysqlPipeline():
def __init__(self, host, database, user, password, port):
self.host = host
self.database = database
self.user = user
self.password = password
self.port = port
@classmethod
def from_crawler(cls, crawler):
return cls(
host=crawler.settings.get('MYSQL_HOST'),
database=crawler.settings.get('MYSQL_DATABASE'),
user=crawler.settings.get('MYSQL_USER'),
password=crawler.settings.get('MYSQL_PASSWORD'),
port=crawler.settings.get('MYSQL_PORT'),
)
def open_spider(self, spider):
self.db = pymysql.connect(self.host, self.user, self.password, self.database, charset='utf8',
port=self.port)
self.cursor = self.db.cursor()
def close_spider(self, spider):
self.db.close()
def process_item(self, item, spider):
print(item['title'])
data = dict(item)
keys = ', '.join(data.keys())
values = ', '.join(['%s'] * len(data))
sql = 'insert into %s (%s) values (%s)' % (item.table, keys, values)
self.cursor.execute(sql, tuple(data.values()))
self.db.commit()
return item
class ImagePipeline(ImagesPipeline):
def file_path(self, request, response=None, info=None):
url = request.url
file_name = url.split('/')[-1]
return file_name
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem('Image Downloaded Failed')
return item
def get_media_requests(self, item, info):
yield Request(item['url'])