使用Scrapy抓取優酷視頻列表頁（電影/電視）

原創

libbit702

2019-01-10 20:34

具體代碼可參看Knowsmore

這裏列表頁是指PC端的入口，如電影

抓取後數據如下：

{
    "link" : "//v.youku.com/v_show/id_XMzMyMzE2MTMxNg==.html",
    "thumb_img" : "http://r1.ykimg.com/051600005AD944F0859B5E040E03BD62",
    "title" : "大毛狗",
    "tag" : [
        "VIP"
    ],
    "actors" : [
        "何明翰",
        "張璇"
    ],
    "play_times" : " 歷史 2,236萬次播放 "
}

# -*- coding: utf-8 -*-
import scrapy
import re
import json
from scrapy import Selector, Request
from knowsmore.items import YoukuListItem
from ..common import *
from ..model.mongodb import *

class YoukuListSpider(scrapy.Spider):
    name = "youku_list"

    custom_settings = {
        'DOWNLOADER_MIDDLEWARES' : {
        }
    }

    start_urls = [
        'https://list.youku.com/category/show/c_96_s_1_d_4_p_29.html'
    ]

    def parse(self, response):
        GRID_SELECTOR = '.panel .mr1'        
        for grid in response.css(GRID_SELECTOR):
            THUMB_IMG_SELECTOR = '.p-thumb img::attr(_src)'
            LINK_SELECTOR = '.info-list .title a::attr(href)'
            TITLE_SELECTOR = '.info-list .title a::text'
            ACTORS_SELECTOR = '.info-list .actor a::text'
            TAG_SELECTOR = '.p-thumb .p-thumb-tagrt span::text'
            PLAY_TIMES_SELECTOR = '.info-list li:nth-child(3)::text'

            item_thumb_img = grid.css(
                THUMB_IMG_SELECTOR).extract_first()
            item_link = grid.css(
                LINK_SELECTOR).extract_first()
            item_title = grid.css(
                TITLE_SELECTOR).extract_first()
            item_actors = grid.css(
                ACTORS_SELECTOR).extract()
            item_tag = grid.css(
                TAG_SELECTOR).extract()
            item_play_times = grid.css(
                PLAY_TIMES_SELECTOR).extract_first()

            # Build Scrapy Item
            youku_item = YoukuListItem(
                thumb_img = item_thumb_img,
                link =  item_link,
                title = item_title,
                actors = item_actors,
                play_times = item_play_times,
                tag = item_tag
            )

            # Send to Pipelines
            yield youku_item


        NEXT_PAGE_SELECTOR = '.yk-pages .next a::attr(href)'
        next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
        if next_page is not None:
            print next_page
            yield response.follow(next_page)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

使用Scrapy抓取優酷視頻列表頁（電影/電視）

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

海康威視網絡攝像頭-預覽出現綠色移動偵測規則框

使用Scrapy抓取優酷視頻列表頁（電影/電視）

使用scrapy抓取Youtube播放頁數據

使用scrapy抓取Youtube播放列表信息

基於Docker的Scrapy+Scrapyd+Scrapydweb部署

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結