scrapy爬取360美食圖片

基本介紹

上面有篇博客專門對scrapy入門爬取進行了一個簡單介紹,而且實現了對新聞網站數據的爬取,這次我們將要爬取360上面的美食圖片。我們將圖片相關的信息保存在MYSQL和MongDB數據庫中,首先我們需要安裝好MYSQL和MongDB數據庫,安裝這一塊大家可以參考網上的安裝教程。

需求分析

首先我們需要了解我們爬取的目標網站:https://image.so.com/z?ch=food,打開這個網頁,我們會發現很多美食圖片,這個時候我們打開谷歌的開發者工具,然後切換到XHR選項,不斷下拉,會呈現很多Ajax請求,如下圖:
在這裏插入圖片描述
我們打開一個請求的詳情:
在這裏插入圖片描述
返回的格式是JSON。其中list就是一張張圖片的詳細信息,包含了30張圖片的ID,名稱,鏈接等信息。我們另外觀察Ajax請求的參數信息,有一個參數sn一直在變化,sn爲30就返回前30張圖片,由此類推,其中ch參數代表類別,listtype是排序方式,其他參數不用管,我們在翻頁請求的時候就改變sn參數就可以了。

新建項目

首先我們新建一個項目,在指定的文件夾位置新建,我們利用cmd命令窗口:

cd C:\Users\lixue\Desktop\test

然後創建項目,並創建一個spiders:

scrapy startproject image360
scrapy genspider images.so.com

這兩條命令分別運行,運行完就生成了那個文件夾以及spider。

構造請求

接下來我們定義爬取的頁數,我這裏爬取30頁,每頁30張,一共900張圖片我們可以先在settings.py裏面定義一個變量MAX_PAGE,添加如下定義:

MAX_PAGE=30

接下來我們定義spider中start_requests()方法 ,來生成30次請求:

class ImagesSpider(Spider):
    name = 'images'
    allowed_domains = ['images.so.com']
    start_urls = ['http://images.so.com/']
    def start_requests(self):
        data ={'ch':'food','listtype':'new'}
        base_url ='https://image.so.com/zjl?'
        for page in range(1,self.settings.get('MAX_PAGE')+1):
            data['sn']= page*30
            params =urlencode(data)
            url = base_url +params
            yield Request(url,self.parse)

我們首先定義初始不定的兩個參數ch和listtype,然後sn參數是遍歷循環生成的,利用urllencode()將字典轉化爲URL請求的GET參數,從而構成完整的URL,構造並生成Request,然後還要引入以下模塊:

from scrapy import Spider, Request
from urllib.parse import urlencode

當我們後面爬取的時候還需要修改setting中的ROBOTSTXT_OBEY,否則無法抓取:

ROBOTSTXT_OBEY = False

接下來,我們可以試着爬取一下;

scrapy crawl images

我們發現返回的都是200,說明請求正常。

提取信息

我們首先需要新建一個Item,叫做ImageItem,如下所示:

from scrapy import Item,Field
class ImageItem(Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    collection = table ='images'
    id =Field()
    url =Field()
    title =Field()
    thumb =Field()

這裏定義了四個字段,採集包括圖片的標題,ID,縮略圖,鏈接。另外兩個屬性collection和table都定義爲字符串,分別代表MongDB和MySQL存儲的Colllection和表名稱。
接下來我們來編寫提取上面這幾個字段一塊的相關信息,將parse()方法改寫爲如下所示:

    def parse(self, response):
        result =json.loads(response.text)
        for image in result.get('list'):
            item =ImageItem()
            item['id'] = image.get('id')
            item['url'] =image.get('qhimg_url')
            item['title']=image.get('title')
            item['thumb'] = image.get('qhimg_downurl')
            yield item

首先json解析,然後遍歷提取相關信息。再對ImangeItem賦值,生成Item對象。

存儲信息

這一塊我們用了MongDB和MySQL ,但在這裏我只以爲MySQL爲例做說明,在做後面存儲之前,我們首先要確保MySQL的安裝和正常使用。

MySQLPipeline

首先新建一個數據庫,名字還是以image360,SQL語句爲:

CREATE DATABASE images360 DEFAULT CHARACTER SET UF8 COLLATE UTF8_general_ci

然後我們再來創建數據表:

CREATE TABLE `images` (
  `id` varchar(255) DEFAULT NULL,
  `url` varchar(255) NOT NULL,
  `title` varchar(255) DEFAULT NULL,
  `thumb` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`url`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

執行完SQL語句之後,我們就成功創建好了數據表,接下來我們就可以往表裏面存儲數據。
我們來實現一個MySQLPipeline,代碼如下所示:

class MysqlPipeline():
    def __init__(self, host, database, user, password, port):
        self.host = host
        self.database = database
        self.user = user
        self.password = password
        self.port = port

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            host=crawler.settings.get('MYSQL_HOST'),
            database=crawler.settings.get('MYSQL_DATABASE'),
            user=crawler.settings.get('MYSQL_USER'),
            password=crawler.settings.get('MYSQL_PASSWORD'),
            port=crawler.settings.get('MYSQL_PORT'),
        )

    def open_spider(self, spider):
        self.db = pymysql.connect(self.host, self.user, self.password, self.database, charset='utf8',
                                  port=self.port)
        self.cursor = self.db.cursor()

    def close_spider(self, spider):
        self.db.close()

    def process_item(self, item, spider):
        print(item['title'])
        data = dict(item)
        keys = ', '.join(data.keys())
        values = ', '.join(['%s'] * len(data))
        sql = 'insert into %s (%s) values (%s)' % (item.table, keys, values)
        self.cursor.execute(sql, tuple(data.values()))
        self.db.commit()
        return item

這裏我們插入數據採取的是動態構造SQL語句的方法,此外我們需要設置MySQL的配置,我們在settings.py裏添加幾個變量,如下所示:

MONGO_URI = 'localhost'
MONGO_DB = 'image360'

MYSQL_HOST = 'localhost'
MYSQL_DATABASE = 'image360'
MYSQL_USER = 'root'
MYSQL_PASSWORD = '123456'
MYSQL_PORT = 3306

定義了數據庫的配置,這樣MySQLPipeline就完成了。

Image Pipeline

我們下面再來看看Image Pipeline的構造,scrapy 專門提供了處理下載的Pipeline,保存文件下載和圖片下載,下載原理和爬取網頁原理是一樣的,下載過程支持多線程和異步,下載十分高效。
我們首先要定義存儲文件的路徑,需要定義一個IMAGES_STORE 變量,在settings中添加這個:

IMAGES_STORE = './images'

即所有下載圖片都存放在這個文件夾中,下面我們來看看我編寫的:

class ImagePipeline(ImagesPipeline):
    def file_path(self, request, response=None, info=None):
        url = request.url
        file_name = url.split('/')[-1]
        return file_name

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem('Image Downloaded Failed')
        return item

    def get_media_requests(self, item, info):
        yield Request(item['url'])

get_media_requests是調用爬取 的Item對象,我們將它的url字段提取出來,然後直接生成Request對象,在將Request對象加入調度隊列中,等待被調度,執行下載。
file_path()主要是爲了構造存儲後的文件名。
item_completed()是單個Item完成下載時處理方法,並不是每一個都下載成功,我們需要剔除下載失敗的就不需要保存這個Item 到數據庫中。

MongDB Pipeline

這裏我就直接給出代碼,不做過多介紹:

class MongoPipeline(object):
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DB')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def process_item(self, item, spider):
        name = item.collection
        self.db[name].insert(dict(item))
        return item

    def close_spider(self, spider):
        self.client.close()

我們還需要改settingst添加一些設置,即是存儲到MongDB的鏈接地址和數據庫名稱。我們需要在settings添加這兩個變量:

MONGO_URI = 'localhost'
MONGO_DB = 'image360'

到這裏,三個Item Pipeline的定義就完成了,最後我們啓用就行,修改settings.py中ITEM_PIPELINES,如下所示:

ITEM_PIPELINES = {
    'image360.pipelines.ImagePipeline': 300,
    'image360.pipelines.MongoPipeline': 301,
    'image360.pipelines.MysqlPipeline': 302,
}

最後我們來運行程序,進行爬取:

scrapy crawl images

爬蟲的輸出日誌爲:
在這裏插入圖片描述
我們還可以看看保存的圖片以及數據庫存儲的信息:
我採集這個圖片純粹就是爲了好玩,看到的不要打我
在這裏插入圖片描述
下面這個是數據庫存儲的信息:
在這裏插入圖片描述

部分代碼

最後我附上修改比較多的模塊代碼,可能會有一些路徑等設置,大家記得要改成和自己電腦的路徑一致,然後如果有更好的意見可以和我聯繫改進這個爬蟲。

1.imange.py
from scrapy import Spider, Request
from urllib.parse import urlencode
import json
from image360.items import ImageItem
class ImagesSpider(Spider):
    name = 'images'
    allowed_domains = ['images.so.com']
    start_urls = ['http://images.so.com/']
    def start_requests(self):
        data ={'ch':'food','listtype':'new'}
        base_url ='https://image.so.com/zjl?'
        for page in range(1,self.settings.get('MAX_PAGE')+1):
            data['sn']= page*30
            params =urlencode(data)
            url = base_url +params
            yield Request(url,self.parse)
    def parse(self, response):
        result =json.loads(response.text)
        for image in result.get('list'):
            item =ImageItem()
            item['id'] = image.get('id')
            item['url'] =image.get('qhimg_url')
            item['title']=image.get('title')
            item['thumb'] = image.get('qhimg_downurl')
            yield item
2.settings.py
BOT_NAME = 'image360'
MAX_PAGE =30
SPIDER_MODULES = ['image360.spiders']
NEWSPIDER_MODULE = 'image360.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'image360 (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'image360.middlewares.Image360SpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'image360.middlewares.Image360DownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'image360.pipelines.Image360Pipeline': 300,
#}
ITEM_PIPELINES = {
    'image360.pipelines.ImagePipeline': 300,
    'image360.pipelines.MongoPipeline': 301,
    'image360.pipelines.MysqlPipeline': 302,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
MONGO_URI = 'localhost'
MONGO_DB = 'image360'

MYSQL_HOST = 'localhost'
MYSQL_DATABASE = 'image360'
MYSQL_USER = 'root'
MYSQL_PASSWORD = '123456'
MYSQL_PORT = 3306
IMAGES_STORE = './images'
3.items.py
from scrapy import Item,Field

class ImageItem(Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    collection = table ='images'
    id =Field()
    url =Field()
    title =Field()
    thumb =Field()
4.pipelines.py
import pymongo
import pymysql
from scrapy import Request
from scrapy.exceptions import DropItem
from scrapy.pipelines.images import ImagesPipeline


class MongoPipeline(object):
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DB')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def process_item(self, item, spider):
        name = item.collection
        self.db[name].insert(dict(item))
        return item

    def close_spider(self, spider):
        self.client.close()


class MysqlPipeline():
    def __init__(self, host, database, user, password, port):
        self.host = host
        self.database = database
        self.user = user
        self.password = password
        self.port = port

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            host=crawler.settings.get('MYSQL_HOST'),
            database=crawler.settings.get('MYSQL_DATABASE'),
            user=crawler.settings.get('MYSQL_USER'),
            password=crawler.settings.get('MYSQL_PASSWORD'),
            port=crawler.settings.get('MYSQL_PORT'),
        )

    def open_spider(self, spider):
        self.db = pymysql.connect(self.host, self.user, self.password, self.database, charset='utf8',
                                  port=self.port)
        self.cursor = self.db.cursor()

    def close_spider(self, spider):
        self.db.close()

    def process_item(self, item, spider):
        print(item['title'])
        data = dict(item)
        keys = ', '.join(data.keys())
        values = ', '.join(['%s'] * len(data))
        sql = 'insert into %s (%s) values (%s)' % (item.table, keys, values)
        self.cursor.execute(sql, tuple(data.values()))
        self.db.commit()
        return item
class ImagePipeline(ImagesPipeline):
    def file_path(self, request, response=None, info=None):
        url = request.url
        file_name = url.split('/')[-1]
        return file_name

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem('Image Downloaded Failed')
        return item

    def get_media_requests(self, item, info):
        yield Request(item['url'])
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章