NO.54——基於scrapy的P站爬蟲

        很久沒有更新博客了,這段時間其實也做了不少東西,但總是懶得坐下來整理下學習筆記,今天終於努力說服自己。做了那麼多東西到底改寫什麼呢?自從接觸python以來首先接觸的就是爬蟲,之前也寫過許多關於爬蟲的博客,但是其中最負盛名的基於scrapy的爬蟲框架還沒有寫過,於是乎就以這爲出發點吧。另外,在github上研究過某大神基於scrapy的爬蟲(github地址我已經找不到了,不過那個爬蟲已經過期了,基本不能用了),這個網站很好,平時我也經常在上邊找一些線性代數、概率論的視頻來研究學習一番,我呢,在巨人的肩膀上,實現了該網站的視頻和視頻封面下載功能,如下:

 

 項目準備

1、科學上網

2、python3.7 (如何配置開發環境這裏不多贅述)

3、連接mongodb

理論基礎

      Scrapy是基於Twisted的異步處理框架,默認是10線程同步。其數據流由引擎控制,數據流的過程如下:

1、Engine首先打開一個網站,找到處理該網站的Spider,並向該Spider請求第一個要爬的URL

2、Engine從Spider中獲取到第一個要爬取的URL,並通過Scheduler以Request的形式調度。

3、Engine從Scheduler請求下一個要爬取的URL

4、Scheduler返回下一個要爬取的URL給Engine,Engine將URL通過Downloader MiddleWares轉發給Downloader下載

5、一旦頁面下載完畢,Downloader生成該頁面的Response,並將其通過Downloader MiddleWares發送給Engine

6、Engine從下載器中接收到Response,並將其通過Spider Middlewares發送給Spider處理

7、Spider處理Response,並返回爬取到的Item及新的 Request給Engine

8、Engine將Spider返回的Item給Item Pipline ,將新的Requst給Scheduler

9、重複2-8   ,直到Scheduler中沒有更多的Request,Engine關閉

項目實踐

1、創建項目

#創建項目文件夾
scrapy startproject pornhubBot
#cd 項目路徑
#創建spider  注意spider的名字#1不能和項目名重複   #2是網站域名
scrapy genspider pornhub pornhub.com   

2、創建Item

          Item是保存爬取數據的容器,它的使用方法和字典類似。創建Item需要繼承scrapy.Item類,並且定義類型爲scrapy.Field的字段。觀察目標網站,我們可以獲取到的內容有:

import scrapy


class PornhubbotItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    video_title = scrapy.Field()  #視頻標題
    image_urls = scrapy.Field()  #縮略圖下載鏈接
    image_paths = scrapy.Field()  #縮略圖本地路徑
    video_duration = scrapy.Field()  #視頻時長
    video_views = scrapy.Field()    #視頻播放量
    video_rating = scrapy.Field()   #視頻熱度排行
    link_url = scrapy.Field()    #視頻在線地址
    file_urls = scrapy.Field()   #分段視頻文件下載鏈接列表
    file_paths = scrapy.Field() #分段視頻文件本地路徑列表

   3、創建Spider

          最核心的便是Spider類了,在這裏要做兩件事:定義爬取網站的動作、分析爬取下來的網頁。

(1)以初始的URL初始化Request,並設置回調函數。當該Request成功請求並返回時,Response生成並作爲參數傳給該回調函數

(2)在回到函數內分析返回的網頁內容。返回結果有兩種形式。一種是解析得到的字典或Item對象,可以直接保存;一種是解析得到的下一個(如下一頁)鏈接,可以利用此鏈接構造Request並設置新的回調函數,返回Request等待後續調度。

(3)如果返回的是字典或Item對象,我們可以通過Feed Exports等組件將返回結果存入到文件。如果設置了Pipeline的話,我們可以使用Pipeline處理並保存。

(4)如果返回的是Requst,那麼Request執行成功得到Response之後,Response會被傳遞給Request中定義的回調函數,在回調函數中我們可以再次使用選擇器(如Selector)來分析新得到的網頁內容,並根據分析的數據生成Item

       通過以上幾步循環,完成整個網站的爬取。

       首先,我們在構建初始URL的時候,我們首先分析整個網站的資源分類,發現pornhub將資源根據熱度排行、觀看總量、評分等方式進行了分類:

"""歸納PornHub資源鏈接"""
PH_TYPES = [
    '',
    'recommended',
    'video?o=ht', # hot
    'video?o=mv', # Most Viewed
    'video?o=tr', # Top Rate

    # Examples of certain categories
    # 'video?c=1',  # Category = Asian
    # 'video?c=111',  # Category = Japanese
]

         Scrapy提供了自己的數據提取方法,即Selector選擇器,是基於lxml來構建的,支持XPath、CSS、正則表達式,如:

from scrapy import Selector

selector = Selector(response)
#xpath
title = selector.xpath('//img/a[@class="image1.html"]/text()').extract_first()
#css
title = selector.css('img a[class="image1.html"]::text').extract_first()

 

# -*- coding: utf-8 -*-
import requests
import logging
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from pornhubBot.items import PornhubbotItem
from pornhubBot.pornhub_type import PH_TYPES
from scrapy.http import Request
import re
import json
import random

class PornhubSpider(CrawlSpider):
    name = 'pornhub'   #每個項目唯一的名字
    allowed_domains = ['www.pornhub.com']   #允許爬取的域名
    host = 'https://www.pornhub.com'
    start_urls = list(set(PH_TYPES))   #啓動時爬取的url列表
    logging.getLogger("requests").setLevel(logging.WARNING
                                           )  # 將requests的日誌級別設成WARNING
    logging.basicConfig(
        level=logging.DEBUG,
        format=
        '%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
        datefmt='%a, %d %b %Y %H:%M:%S',
        filename='cataline.log',
        filemode='w')
                                           
    # 構建初始URL,並設置回調函數
    def start_requests(self):
        for ph_type in self.start_urls:
            yield Request(url='https://www.pornhub.com/%s' % ph_type,callback=self.parse_ph_key)
    #迭代Request
    def parse_ph_key(self, response):
        selector = Selector(response)
        logging.debug('request url:------>' + response.url)
        # logging.info(selector)
        divs = selector.xpath('//div[@class="phimage"]')
        for div in divs:
            # logging.debug('divs :------>' + div.extract())
            #herf = " viewkey= ******",匹配雙引號之前的數字
            viewkey = re.findall('viewkey=(.*?)"', div.extract())
            # logging.debug(viewkey)
            #這裏返回的是在線視頻播放頁面,因爲我們要從單個視頻在線播放頁面的源碼中尋找我們所要的信息
            yield Request(url='https://www.pornhub.com/view_video.php?viewkey=%s' % viewkey[0],
                          callback=self.parse_ph_info)
        #找到 next 按鈕 ,並提取 herf 屬性<a href="/video?o=ht&page=2" class="orangeButton">
        url_next = selector.xpath('//a[@class="orangeButton" and text()="Next "]/@href').extract()
        logging.debug(url_next)
        if url_next:
            # if self.test:
            logging.debug(' next page:---------->' + self.host + url_next[0])
            yield Request(url=self.host + url_next[0],callback=self.parse_ph_key)
    # 解析得到Item
    def parse_ph_info(self, response):
        phItem = PornhubbotItem()
        selector = Selector(response)
        # logging.info(selector)
        #方括號把一列字符或一個範圍括在了一起 (或兩者). 例如, [abc] 表示 "a, b 或 c 的中任何一個字符
        #豎線將兩個或多個可選項目分隔開來. 如果可選項目中 任何一個 滿足條件, 則會形成匹配. 例如, gray|grey 既可以匹配 gray 也可以匹配 grey.
        _ph_info = re.findall('var flashvars_\d+ =(.*?)[,|;]\n', selector.extract())
        logging.debug('PH信息的JSON:')
        logging.debug(_ph_info)
        _ph_info_json = json.loads(_ph_info[0])
        duration = _ph_info_json.get('video_duration')
        phItem['video_duration'] = duration
        title = _ph_info_json.get('video_title')
        phItem['video_title'] = title
        image_urls = _ph_info_json.get('image_url')
        phItem['image_urls'] = image_urls
        link_url = _ph_info_json.get('link_url')
        phItem['link_url'] = link_url
        file_urls = _ph_info_json.get('quality_480p')
        phItem['file_urls'] = file_urls

        yield phItem

4、創建 Item Pipeline

       當Spider解析完Response之後,Item就會傳到Item Pipeline,被定義的Item Pipeline組件會被順次調用,完成一連串的處理過程:

  • 清理html數據
  • 驗證爬取數據,檢查爬取字段
  • 查看並丟棄重複內容
  • 將爬取結果保存到數據庫

       

import pymongo
from pymongo import IndexModel, ASCENDING
from pornhubBot import items 
from scrapy import Request
from scrapy.exceptions import DropItem
from scrapy.pipelines.images import ImagesPipeline
from scrapy.pipelines.files import FilesPipeline

#鏈接mongodb
class PornhubbotMongoDBPipeline(object):
    def __init__(self):
        clinet = pymongo.MongoClient("localhost", 27017)
        db = clinet["PornHub"]
        self.PhRes = db["PhRes"]
        #建立數據庫的索引,一個索引也可以
        idx1 = IndexModel([('link_url', ASCENDING)], unique=True)
        idx2 = IndexModel([('video_title', ASCENDING)], unique=True)
        self.PhRes.create_indexes([idx1,idx2])
        # if your existing DB has duplicate records, refer to:
        # https://stackoverflow.com/questions/35707496/remove-duplicate-in-mongodb/35711737
    #這是必須實現的方法
    def process_item(self, item, spider):
        print('MongoDBItem', item)
        """ 判斷類型 存入MongoDB """
        if isinstance(item, items.PornhubbotItem):
            print('PornVideoItem True')
            try:
                #'$set‘操作符替換掉指定字段的值,意爲更新數據
                self.PhRes.update_one({'video_title': item['video_title']}, {'$set': dict(item)}, upsert=True)
            except Exception:
                pass
        return item

#鏈接Feed Exports組件
# https://doc.scrapy.org/en/latest/topics/media-pipeline.html#module-scrapy.pipelines.files
class VideoThumbPipeline(ImagesPipeline):
    
    # 自定義縮略圖路徑(及命名), 注意該路徑是 IMAGES_STORE 的相對路徑
    def file_path(self, request, response=None, info=None):
        file_name = request.url.split('/')[-1]
        return "%s/thumb.jpg" % file_name  # 返回路徑及命名格式
    
    # 下載完成後, 將縮略圖本地路徑(IMAGES_STORE + 相對路徑)填入到 item 的 thumb_path
    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem('Image Downloaded Failed')
        item['image_paths'] = image_paths
        return item
    # 從item中取出縮略圖的url並下載文件
    def get_media_requests(self, item, info):
        yield Request(url=item['image_urls'], meta={'item': item})

# https://doc.scrapy.org/en/latest/topics/media-pipeline.html#module-scrapy.pipelines.files
class VideoFilesPipeline(FilesPipeline):
    # 從item中取出分段視頻的url列表並下載文件
    def get_media_requests(self, item, info):
        yield Request(url=item['file_urls'], meta={'item': item})

    # 自定義分段視頻下載到本地的路徑(以及命名), 注意該路徑是 FILES_STORE 的相對路徑
    def file_path(self, request, response=None, info=None):
        url = request.url
        file_name = url.split('/')[-1]
        return "%s/%s.mp4" % (file_name, file_name) # 返回路徑及命名格式
        #return file_name

    # 下載完成後, 將分段視頻本地路徑列表(FILES_STORE + 相對路徑)填入到 item 的 file_paths
    def item_completed(self, results, item, info):
        file_paths = [x['path'] for ok, x in results if ok]
        if not file_paths:
            raise DropItem("Item contains no files")
        item['file_paths'] = file_paths
        return item

         當然,還要在Settings裏面定義各Pipeline的調用順序,數值越小越先被調用。

#Scrapy自帶了Feed輸出,並且支持多種序列化格式
#生成存儲文件中文不亂碼
FEED_EXPORT_ENCODING = 'utf-8'
FEED_URI=u'/Users/chenyan/important/python_demo/pornhubBot/pornhub.csv'
FEED_FORMAT='CSV'
#配置文件下載目錄FILE_STORE
IMAGES_STORE = u'/Users/chenyan/important/python_demo/pornhubBot/Downloads'
FILES_STORE = u'/Users/chenyan/important/python_demo/pornhubBot/Downloads'
IMAGES_URLS_FIELD = 'image_urls'  # 自定義鏈接Field
FILES_URLS_FIELD = 'file_urls'   # 自定義鏈接Field
IMAGES_THUMBS = {
    'small': (50, 50),
    'big': (270, 270),
}
#濾出小圖片
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110

DOWNLOADER_MIDDLEWARES = {
    "pornhubBot.middlewares.UserAgentMiddleware": 401,
    "pornhubBot.middlewares.CookiesMiddleware": 402,
    "pornhubBot.middlewares.ProxyMiddleware": 403,
}
ITEM_PIPELINES = {
    "pornhubBot.pipelines.PornhubbotMongoDBPipeline": 3,
    'pornhubBot.pipelines.VideoThumbPipeline': 1,
    'pornhubBot.pipelines.VideoFilesPipeline': 1,
}

5、創建Middleware

           pornhub採用了並不是很嚴格的反爬策略,一開始沒有設置代理時,爬取了幾次我的IP就被封禁了,因此我使用了自己維護的代理池。常用的爬蟲策略,userAgent,cookies,proxy的設置都可以放在這裏。

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

import random
import json
import logging
import requests


class UserAgentMiddleware(object):
    """ 換User-Agent """

    def __init__(self, agents):
        self.agents = agents

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings.getlist('USER_AGENTS'))

    def process_request(self, request, spider):
        # print "**************************" + random.choice(self.agents)
        request.headers.setdefault('User-Agent', random.choice(self.agents))


class CookiesMiddleware(object):
    """ 換Cookie """
    cookie = {
        'platform': 'pc',
        'ss': '367701188698225489',
        'bs': '%s',
        'RNLBSERVERID': 'ded6699',
        'FastPopSessionRequestNumber': '1',
        'FPSRN': '1',
        'performance_timing': 'home',
        'RNKEY': '40859743*68067497:1190152786:3363277230:1'
    }

    def process_request(self, request, spider):
        bs = ''
        for i in range(32):
            bs += chr(random.randint(97, 122))
        _cookie = json.dumps(self.cookie) % bs
        request.cookies = json.loads(_cookie)

class ProxyMiddleware():
    #獲取隨機可用代理的地址爲:http://localhost:5555/random
    def __init__(self,proxy_url):
        self.logger = logging.getLogger(__name__)
        self.proxy_url = proxy_url

    def get_random_proxy(self):
        try:
            response = requests.get(self.proxy_url)
            if response.status_code == 200:
                proxy = response.text
                return proxy
        except requests.ConnectionError:
            return False

    @classmethod
    def from_crawler(cls,crawler):
        settings = crawler.settings
        return cls(
            proxy_url = settings.get('PROXY_URL')
        )

    def process_request(self,request,spider):
        #request.meta 是一個Python字典 ,'retry_times' 是scrapy常見的請求參數
        if request.meta.get('retry_times'):
            proxy = self.get_random_proxy()
            if proxy:
                uri = 'https://{proxy}'.format(proxy = proxy)
                self.logger.debug('使用代理:'+ proxy)
                request.meta['proxy'] = uri

    我把常用的UserAgent放在了Setting裏:

USER_AGENTS = [
          "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
          "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
          "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
          "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
          "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
          "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
          "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
          "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
          "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
          "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
          "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
          "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
          "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
          "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
          "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
          "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
          "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
          "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
          "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
          "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2",
          "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1",
          "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre",
          "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )",
          "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)",
          "Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a",
          "Mozilla/2.02E (Win95; U)",
          "Mozilla/3.01Gold (Win95; I)",
          "Mozilla/4.8 [en] (Windows NT 5.1; U)",
          "Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)",
          "HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",
          "Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",
          "Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
          "Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",
          "Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
          "Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522  (KHTML, like Gecko) Safari/419.3",
          "Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
          "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
          "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
          "Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          ]

  6、構建Settings

       在這裏對一些東西做了定義

# -*- coding: utf-8 -*-

# Scrapy settings for pornhubBot project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'pornhubBot'

SPIDER_MODULES = ['pornhubBot.spiders']
NEWSPIDER_MODULE = 'pornhubBot.spiders'


DOWNLOAD_DELAY = 1  # 間隔時間
# LOG_LEVEL = 'INFO'  # 日誌級別
CONCURRENT_REQUESTS = 20  # 默認爲16
# CONCURRENT_ITEMS = 1
# CONCURRENT_REQUESTS_PER_IP = 1
REDIRECT_ENABLED = False
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'pornhub (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

#獲取隨機代理的地址
PROXY_URL = 'http://localhost:5555/random'

#Scrapy自帶了Feed輸出,並且支持多種序列化格式
#生成存儲文件中文不亂碼
FEED_EXPORT_ENCODING = 'utf-8'
FEED_URI=u'/Users/chenyan/important/python_demo/pornhubBot/pornhub.csv'
FEED_FORMAT='CSV'
#配置文件下載目錄FILE_STORE
IMAGES_STORE = u'/Users/chenyan/important/python_demo/pornhubBot/Downloads'
FILES_STORE = u'/Users/chenyan/important/python_demo/pornhubBot/Downloads'
IMAGES_URLS_FIELD = 'image_urls'  # 自定義鏈接Field
FILES_URLS_FIELD = 'file_urls'   # 自定義鏈接Field
IMAGES_THUMBS = {
    'small': (50, 50),
    'big': (270, 270),
}
#濾出小圖片
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110

DOWNLOADER_MIDDLEWARES = {
    "pornhubBot.middlewares.UserAgentMiddleware": 401,
    "pornhubBot.middlewares.CookiesMiddleware": 402,
    "pornhubBot.middlewares.ProxyMiddleware": 403,
}
ITEM_PIPELINES = {
    "pornhubBot.pipelines.PornhubbotMongoDBPipeline": 3,
    'pornhubBot.pipelines.VideoThumbPipeline': 1,
    'pornhubBot.pipelines.VideoFilesPipeline': 1,
}
#默認情況下,Scrapy使用 LIFO 隊列來存儲等待的請求。簡單的說,就是 深度優先順序
#以 廣度優先順序 進行爬取
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'

USER_AGENTS = [
          "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
          "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
          "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
          "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
          "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
          "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
          "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
          "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
          "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
          "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
          "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
          "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
          "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
          "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
          "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
          "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
          "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
          "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
          "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
          "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2",
          "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1",
          "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre",
          "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )",
          "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)",
          "Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a",
          "Mozilla/2.02E (Win95; U)",
          "Mozilla/3.01Gold (Win95; I)",
          "Mozilla/4.8 [en] (Windows NT 5.1; U)",
          "Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)",
          "HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",
          "Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",
          "Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
          "Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",
          "Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
          "Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522  (KHTML, like Gecko) Safari/419.3",
          "Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
          "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
          "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
          "Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          ]

7、定義一個快速開啓通道

from __future__ import absolute_import
from scrapy import cmdline

cmdline.execute("scrapy crawl pornhub".split())

運行這個程序就可以了

《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《

           以上,就可以離線學高數啦!!!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章