很久沒有更新博客了，這段時間其實也做了不少東西，但總是懶得坐下來整理下學習筆記，今天終於努力說服自己。做了那麼多東西到底改寫什麼呢？自從接觸python以來首先接觸的就是爬蟲，之前也寫過許多關於爬蟲的博客，但是其中最負盛名的基於scrapy的爬蟲框架還沒有寫過，於是乎就以這爲出發點吧。另外，在github上研究過某大神基於scrapy的爬蟲（github地址我已經找不到了，不過那個爬蟲已經過期了，基本不能用了），這個網站很好，平時我也經常在上邊找一些線性代數、概率論的視頻來研究學習一番，我呢，在巨人的肩膀上，實現了該網站的視頻和視頻封面下載功能，如下：

項目準備

1、科學上網

2、python3.7 (如何配置開發環境這裏不多贅述)

3、連接mongodb

理論基礎

Scrapy是基於Twisted的異步處理框架，默認是10線程同步。其數據流由引擎控制，數據流的過程如下：

1、Engine首先打開一個網站，找到處理該網站的Spider，並向該Spider請求第一個要爬的URL

2、Engine從Spider中獲取到第一個要爬取的URL，並通過Scheduler以Request的形式調度。

3、Engine從Scheduler請求下一個要爬取的URL

4、Scheduler返回下一個要爬取的URL給Engine，Engine將URL通過Downloader MiddleWares轉發給Downloader下載

5、一旦頁面下載完畢，Downloader生成該頁面的Response，並將其通過Downloader MiddleWares發送給Engine

6、Engine從下載器中接收到Response，並將其通過Spider Middlewares發送給Spider處理

7、Spider處理Response，並返回爬取到的Item及新的 Request給Engine

8、Engine將Spider返回的Item給Item Pipline ,將新的Requst給Scheduler

9、重複2-8 ，直到Scheduler中沒有更多的Request，Engine關閉

項目實踐

1、創建項目

#創建項目文件夾
scrapy startproject pornhubBot
#cd 項目路徑
#創建spider  注意spider的名字#1不能和項目名重複   #2是網站域名
scrapy genspider pornhub pornhub.com

2、創建Item

Item是保存爬取數據的容器，它的使用方法和字典類似。創建Item需要繼承scrapy.Item類，並且定義類型爲scrapy.Field的字段。觀察目標網站，我們可以獲取到的內容有：

import scrapy


class PornhubbotItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    video_title = scrapy.Field()  #視頻標題
    image_urls = scrapy.Field()  #縮略圖下載鏈接
    image_paths = scrapy.Field()  #縮略圖本地路徑
    video_duration = scrapy.Field()  #視頻時長
    video_views = scrapy.Field()    #視頻播放量
    video_rating = scrapy.Field()   #視頻熱度排行
    link_url = scrapy.Field()    #視頻在線地址
    file_urls = scrapy.Field()   #分段視頻文件下載鏈接列表
    file_paths = scrapy.Field() #分段視頻文件本地路徑列表

3、創建Spider

最核心的便是Spider類了，在這裏要做兩件事：定義爬取網站的動作、分析爬取下來的網頁。

（1）以初始的URL初始化Request，並設置回調函數。當該Request成功請求並返回時，Response生成並作爲參數傳給該回調函數

（2）在回到函數內分析返回的網頁內容。返回結果有兩種形式。一種是解析得到的字典或Item對象，可以直接保存；一種是解析得到的下一個（如下一頁）鏈接，可以利用此鏈接構造Request並設置新的回調函數，返回Request等待後續調度。

（3）如果返回的是字典或Item對象，我們可以通過Feed Exports等組件將返回結果存入到文件。如果設置了Pipeline的話，我們可以使用Pipeline處理並保存。

（4）如果返回的是Requst，那麼Request執行成功得到Response之後，Response會被傳遞給Request中定義的回調函數，在回調函數中我們可以再次使用選擇器（如Selector)來分析新得到的網頁內容，並根據分析的數據生成Item

通過以上幾步循環，完成整個網站的爬取。

首先，我們在構建初始URL的時候，我們首先分析整個網站的資源分類，發現pornhub將資源根據熱度排行、觀看總量、評分等方式進行了分類：

"""歸納PornHub資源鏈接"""
PH_TYPES = [
    '',
    'recommended',
    'video?o=ht', # hot
    'video?o=mv', # Most Viewed
    'video?o=tr', # Top Rate

    # Examples of certain categories
    # 'video?c=1',  # Category = Asian
    # 'video?c=111',  # Category = Japanese
]

Scrapy提供了自己的數據提取方法，即Selector選擇器，是基於lxml來構建的，支持XPath、CSS、正則表達式，如：

from scrapy import Selector

selector = Selector(response)
#xpath
title = selector.xpath('//img/a[@class="image1.html"]/text()').extract_first()
#css
title = selector.css('img a[class="image1.html"]::text').extract_first()

# -*- coding: utf-8 -*-
import requests
import logging
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from pornhubBot.items import PornhubbotItem
from pornhubBot.pornhub_type import PH_TYPES
from scrapy.http import Request
import re
import json
import random

class PornhubSpider(CrawlSpider):
    name = 'pornhub'   #每個項目唯一的名字
    allowed_domains = ['www.pornhub.com']   #允許爬取的域名
    host = 'https://www.pornhub.com'
    start_urls = list(set(PH_TYPES))   #啓動時爬取的url列表
    logging.getLogger("requests").setLevel(logging.WARNING
                                           )  # 將requests的日誌級別設成WARNING
    logging.basicConfig(
        level=logging.DEBUG,
        format=
        '%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
        datefmt='%a, %d %b %Y %H:%M:%S',
        filename='cataline.log',
        filemode='w')
                                           
    # 構建初始URL,並設置回調函數
    def start_requests(self):
        for ph_type in self.start_urls:
            yield Request(url='https://www.pornhub.com/%s' % ph_type,callback=self.parse_ph_key)
    #迭代Request
    def parse_ph_key(self, response):
        selector = Selector(response)
        logging.debug('request url:------>' + response.url)
        # logging.info(selector)
        divs = selector.xpath('//div[@class="phimage"]')
        for div in divs:
            # logging.debug('divs :------>' + div.extract())
            #herf = " viewkey= ******",匹配雙引號之前的數字
            viewkey = re.findall('viewkey=(.*?)"', div.extract())
            # logging.debug(viewkey)
            #這裏返回的是在線視頻播放頁面，因爲我們要從單個視頻在線播放頁面的源碼中尋找我們所要的信息
            yield Request(url='https://www.pornhub.com/view_video.php?viewkey=%s' % viewkey[0],
                          callback=self.parse_ph_info)
        #找到 next 按鈕 ，並提取 herf 屬性<a href="/video?o=ht&page=2" class="orangeButton">
        url_next = selector.xpath('//a[@class="orangeButton" and text()="Next "]/@href').extract()
        logging.debug(url_next)
        if url_next:
            # if self.test:
            logging.debug(' next page:---------->' + self.host + url_next[0])
            yield Request(url=self.host + url_next[0],callback=self.parse_ph_key)
    # 解析得到Item
    def parse_ph_info(self, response):
        phItem = PornhubbotItem()
        selector = Selector(response)
        # logging.info(selector)
        #方括號把一列字符或一個範圍括在了一起 (或兩者). 例如, [abc] 表示 "a, b 或 c 的中任何一個字符
        #豎線將兩個或多個可選項目分隔開來. 如果可選項目中 任何一個 滿足條件, 則會形成匹配. 例如, gray|grey 既可以匹配 gray 也可以匹配 grey.
        _ph_info = re.findall('var flashvars_\d+ =(.*?)[,|;]\n', selector.extract())
        logging.debug('PH信息的JSON:')
        logging.debug(_ph_info)
        _ph_info_json = json.loads(_ph_info[0])
        duration = _ph_info_json.get('video_duration')
        phItem['video_duration'] = duration
        title = _ph_info_json.get('video_title')
        phItem['video_title'] = title
        image_urls = _ph_info_json.get('image_url')
        phItem['image_urls'] = image_urls
        link_url = _ph_info_json.get('link_url')
        phItem['link_url'] = link_url
        file_urls = _ph_info_json.get('quality_480p')
        phItem['file_urls'] = file_urls

        yield phItem

4、創建 Item Pipeline

當Spider解析完Response之後，Item就會傳到Item Pipeline，被定義的Item Pipeline組件會被順次調用，完成一連串的處理過程：

清理html數據
驗證爬取數據，檢查爬取字段
查看並丟棄重複內容
將爬取結果保存到數據庫

import pymongo
from pymongo import IndexModel, ASCENDING
from pornhubBot import items 
from scrapy import Request
from scrapy.exceptions import DropItem
from scrapy.pipelines.images import ImagesPipeline
from scrapy.pipelines.files import FilesPipeline

#鏈接mongodb
class PornhubbotMongoDBPipeline(object):
    def __init__(self):
        clinet = pymongo.MongoClient("localhost", 27017)
        db = clinet["PornHub"]
        self.PhRes = db["PhRes"]
        #建立數據庫的索引，一個索引也可以
        idx1 = IndexModel([('link_url', ASCENDING)], unique=True)
        idx2 = IndexModel([('video_title', ASCENDING)], unique=True)
        self.PhRes.create_indexes([idx1,idx2])
        # if your existing DB has duplicate records, refer to:
        # https://stackoverflow.com/questions/35707496/remove-duplicate-in-mongodb/35711737
    #這是必須實現的方法
    def process_item(self, item, spider):
        print('MongoDBItem', item)
        """ 判斷類型 存入MongoDB """
        if isinstance(item, items.PornhubbotItem):
            print('PornVideoItem True')
            try:
                #'$set‘操作符替換掉指定字段的值，意爲更新數據
                self.PhRes.update_one({'video_title': item['video_title']}, {'$set': dict(item)}, upsert=True)
            except Exception:
                pass
        return item

#鏈接Feed Exports組件
# https://doc.scrapy.org/en/latest/topics/media-pipeline.html#module-scrapy.pipelines.files
class VideoThumbPipeline(ImagesPipeline):
    
    # 自定義縮略圖路徑(及命名), 注意該路徑是 IMAGES_STORE 的相對路徑
    def file_path(self, request, response=None, info=None):
        file_name = request.url.split('/')[-1]
        return "%s/thumb.jpg" % file_name  # 返回路徑及命名格式
    
    # 下載完成後, 將縮略圖本地路徑(IMAGES_STORE + 相對路徑)填入到 item 的 thumb_path
    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem('Image Downloaded Failed')
        item['image_paths'] = image_paths
        return item
    # 從item中取出縮略圖的url並下載文件
    def get_media_requests(self, item, info):
        yield Request(url=item['image_urls'], meta={'item': item})

# https://doc.scrapy.org/en/latest/topics/media-pipeline.html#module-scrapy.pipelines.files
class VideoFilesPipeline(FilesPipeline):
    # 從item中取出分段視頻的url列表並下載文件
    def get_media_requests(self, item, info):
        yield Request(url=item['file_urls'], meta={'item': item})

    # 自定義分段視頻下載到本地的路徑(以及命名), 注意該路徑是 FILES_STORE 的相對路徑
    def file_path(self, request, response=None, info=None):
        url = request.url
        file_name = url.split('/')[-1]
        return "%s/%s.mp4" % (file_name, file_name) # 返回路徑及命名格式
        #return file_name

    # 下載完成後, 將分段視頻本地路徑列表(FILES_STORE + 相對路徑)填入到 item 的 file_paths
    def item_completed(self, results, item, info):
        file_paths = [x['path'] for ok, x in results if ok]
        if not file_paths:
            raise DropItem("Item contains no files")
        item['file_paths'] = file_paths
        return item

當然，還要在Settings裏面定義各Pipeline的調用順序，數值越小越先被調用。

#Scrapy自帶了Feed輸出，並且支持多種序列化格式
#生成存儲文件中文不亂碼
FEED_EXPORT_ENCODING = 'utf-8'
FEED_URI=u'/Users/chenyan/important/python_demo/pornhubBot/pornhub.csv'
FEED_FORMAT='CSV'
#配置文件下載目錄FILE_STORE
IMAGES_STORE = u'/Users/chenyan/important/python_demo/pornhubBot/Downloads'
FILES_STORE = u'/Users/chenyan/important/python_demo/pornhubBot/Downloads'
IMAGES_URLS_FIELD = 'image_urls'  # 自定義鏈接Field
FILES_URLS_FIELD = 'file_urls'   # 自定義鏈接Field
IMAGES_THUMBS = {
    'small': (50, 50),
    'big': (270, 270),
}
#濾出小圖片
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110

DOWNLOADER_MIDDLEWARES = {
    "pornhubBot.middlewares.UserAgentMiddleware": 401,
    "pornhubBot.middlewares.CookiesMiddleware": 402,
    "pornhubBot.middlewares.ProxyMiddleware": 403,
}
ITEM_PIPELINES = {
    "pornhubBot.pipelines.PornhubbotMongoDBPipeline": 3,
    'pornhubBot.pipelines.VideoThumbPipeline': 1,
    'pornhubBot.pipelines.VideoFilesPipeline': 1,
}

5、創建Middleware

pornhub採用了並不是很嚴格的反爬策略，一開始沒有設置代理時，爬取了幾次我的IP就被封禁了，因此我使用了自己維護的代理池。常用的爬蟲策略，userAgent，cookies，proxy的設置都可以放在這裏。

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

import random
import json
import logging
import requests


class UserAgentMiddleware(object):
    """ 換User-Agent """

    def __init__(self, agents):
        self.agents = agents

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings.getlist('USER_AGENTS'))

    def process_request(self, request, spider):
        # print "**************************" + random.choice(self.agents)
        request.headers.setdefault('User-Agent', random.choice(self.agents))


class CookiesMiddleware(object):
    """ 換Cookie """
    cookie = {
        'platform': 'pc',
        'ss': '367701188698225489',
        'bs': '%s',
        'RNLBSERVERID': 'ded6699',
        'FastPopSessionRequestNumber': '1',
        'FPSRN': '1',
        'performance_timing': 'home',
        'RNKEY': '40859743*68067497:1190152786:3363277230:1'
    }

    def process_request(self, request, spider):
        bs = ''
        for i in range(32):
            bs += chr(random.randint(97, 122))
        _cookie = json.dumps(self.cookie) % bs
        request.cookies = json.loads(_cookie)

class ProxyMiddleware():
    #獲取隨機可用代理的地址爲：http://localhost:5555/random
    def __init__(self,proxy_url):
        self.logger = logging.getLogger(__name__)
        self.proxy_url = proxy_url

    def get_random_proxy(self):
        try:
            response = requests.get(self.proxy_url)
            if response.status_code == 200:
                proxy = response.text
                return proxy
        except requests.ConnectionError:
            return False

    @classmethod
    def from_crawler(cls,crawler):
        settings = crawler.settings
        return cls(
            proxy_url = settings.get('PROXY_URL')
        )

    def process_request(self,request,spider):
        #request.meta 是一個Python字典 ,'retry_times' 是scrapy常見的請求參數
        if request.meta.get('retry_times'):
            proxy = self.get_random_proxy()
            if proxy:
                uri = 'https://{proxy}'.format(proxy = proxy)
                self.logger.debug('使用代理：'+ proxy)
                request.meta['proxy'] = uri

我把常用的UserAgent放在了Setting裏：

USER_AGENTS = [
          "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
          "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
          "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
          "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
          "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
          "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
          "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
          "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
          "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
          "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
          "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
          "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
          "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
          "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
          "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
          "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
          "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
          "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
          "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
          "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2",
          "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1",
          "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre",
          "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )",
          "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)",
          "Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a",
          "Mozilla/2.02E (Win95; U)",
          "Mozilla/3.01Gold (Win95; I)",
          "Mozilla/4.8 [en] (Windows NT 5.1; U)",
          "Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)",
          "HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",
          "Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",
          "Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
          "Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",
          "Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
          "Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522  (KHTML, like Gecko) Safari/419.3",
          "Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
          "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
          "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
          "Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          ]

6、構建Settings

在這裏對一些東西做了定義

# -*- coding: utf-8 -*-

# Scrapy settings for pornhubBot project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'pornhubBot'

SPIDER_MODULES = ['pornhubBot.spiders']
NEWSPIDER_MODULE = 'pornhubBot.spiders'


DOWNLOAD_DELAY = 1  # 間隔時間
# LOG_LEVEL = 'INFO'  # 日誌級別
CONCURRENT_REQUESTS = 20  # 默認爲16
# CONCURRENT_ITEMS = 1
# CONCURRENT_REQUESTS_PER_IP = 1
REDIRECT_ENABLED = False
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'pornhub (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

#獲取隨機代理的地址
PROXY_URL = 'http://localhost:5555/random'

#Scrapy自帶了Feed輸出，並且支持多種序列化格式
#生成存儲文件中文不亂碼
FEED_EXPORT_ENCODING = 'utf-8'
FEED_URI=u'/Users/chenyan/important/python_demo/pornhubBot/pornhub.csv'
FEED_FORMAT='CSV'
#配置文件下載目錄FILE_STORE
IMAGES_STORE = u'/Users/chenyan/important/python_demo/pornhubBot/Downloads'
FILES_STORE = u'/Users/chenyan/important/python_demo/pornhubBot/Downloads'
IMAGES_URLS_FIELD = 'image_urls'  # 自定義鏈接Field
FILES_URLS_FIELD = 'file_urls'   # 自定義鏈接Field
IMAGES_THUMBS = {
    'small': (50, 50),
    'big': (270, 270),
}
#濾出小圖片
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110

DOWNLOADER_MIDDLEWARES = {
    "pornhubBot.middlewares.UserAgentMiddleware": 401,
    "pornhubBot.middlewares.CookiesMiddleware": 402,
    "pornhubBot.middlewares.ProxyMiddleware": 403,
}
ITEM_PIPELINES = {
    "pornhubBot.pipelines.PornhubbotMongoDBPipeline": 3,
    'pornhubBot.pipelines.VideoThumbPipeline': 1,
    'pornhubBot.pipelines.VideoFilesPipeline': 1,
}
#默認情況下，Scrapy使用 LIFO 隊列來存儲等待的請求。簡單的說，就是 深度優先順序
#以 廣度優先順序 進行爬取
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'

USER_AGENTS = [
          "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
          "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
          "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
          "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
          "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
          "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
          "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
          "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
          "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
          "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
          "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
          "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
          "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
          "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
          "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
          "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
          "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
          "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
          "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
          "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2",
          "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1",
          "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre",
          "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )",
          "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)",
          "Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a",
          "Mozilla/2.02E (Win95; U)",
          "Mozilla/3.01Gold (Win95; I)",
          "Mozilla/4.8 [en] (Windows NT 5.1; U)",
          "Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)",
          "HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",
          "Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",
          "Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
          "Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",
          "Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
          "Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522  (KHTML, like Gecko) Safari/419.3",
          "Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
          "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
          "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
          "Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          ]

7、定義一個快速開啓通道

from __future__ import absolute_import
from scrapy import cmdline

cmdline.execute("scrapy crawl pornhub".split())

運行這個程序就可以了

《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《

以上，就可以離線學高數啦！！！

NO.54——基於scrapy的P站爬蟲

項目準備

理論基礎

項目實踐

1、創建項目

2、創建Item

3、創建Spider

4、創建 Item Pipeline

5、創建Middleware

6、構建Settings

7、定義一個快速開啓通道

【SQL進階】CASE語句的使用

npm error Cannot read properties of null (reading 'isDescendantOf')

NO.55——Face Swapping with Python,dlib,openCV(換臉)

NO.57——基於LBP(局部二值特徵模式)的人臉識別器

NO.66——人工智能學習：python實現一致代價搜索算法

NO.59——100天機器學習實踐第二天：簡單線性迴歸模型

NO.58——100天機器學習實踐第一天：數據預處理

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結