《Python網絡爬蟲實戰》讀書筆記2

更強大的爬蟲

網站反爬蟲

網站的反爬蟲策略有如下幾種

  • 1.識別request headers信息
  • 2.使用AJAX和動態加載
  • 3.驗證碼
  • 4.更改服務器返回的信息
  • 5.限制或封禁IP
  • 6.修改網頁或URL內容
  • 7.賬號限制

下面的示例通過獲取西刺代理的公開ip,然後用獲取到的ip通過requests訪問csdn的blog鏈接,比較好用的代碼是獲取代理信息

# 增加博客訪問量
import re, random, requests, logging
from lxml import html
from multiprocessing.dummy import Pool as ThreadPool

logging.basicConfig(level=logging.DEBUG)
TIME_OUT = 6  # 超時時間
count = 0
proxies = []
headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
           'Accept-Encoding': 'gzip, deflate, sdch, br',
           'Accept-Language': 'zh-CN,zh;q=0.8',
           'Connection': 'keep-alive',
           'Cache-Control': 'max-age=0',
           'Upgrade-Insecure-Requests': '1',
           'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
                         'Chrome/36.0.1985.125 Safari/537.36',
           }
PROXY_URL = 'http://www.xicidaili.com/'


def GetProxies():
   global proxies
   try:
      res = requests.get(PROXY_URL, headers=headers)
   except:
      logging.error('Visit failed')
      return

   ht = html.fromstring(res.text)
   raw_proxy_list = ht.xpath('//*[@id="ip_list"]/tr[@class="odd"]')
   for item in raw_proxy_list:
      if item.xpath('./td[6]/text()')[0] == 'HTTP':
         proxies.append(
            dict(
               http='{}:{}'.format(
                  item.xpath('./td[2]/text()')[0], item.xpath('./td[3]/text()')[0])
            )
         )


# 獲取博客文章列表
def GetArticles(url):
   res = GetRequest(url, prox=None)
   html = res.content.decode('utf-8')
   rgx = '<li class="blog-unit">[ \n\t]*<a href="(.+?)"" target="_blank">'
   ptn = re.compile(rgx)
   blog_list = re.findall(ptn, str(html))
   return blog_list

def GetRequest(url, prox):
   req = requests.get(url, headers=headers, proxies=prox, timeout=TIME_OUT)
   return req

# 訪問博客
def VisitWithProxy(url):
   proxy = random.choice(proxies)  # 隨機選擇一個代理
   GetRequest(url, proxy)

# 多次訪問
def VisitLoop(url):
   for i in range(count):
      logging.debug('Visiting:\t{}\tfor {} times'.format(url, i))
      VisitWithProxy(url)


if __name__ == '__main__':
   global count

   GetProxies()  # 獲取代理
   logging.debug('We got {} proxies'.format(len(proxies)))
   BlogUrl = input('Blog Address:').strip(' ')
   logging.debug('Gonna visit{}'.format(BlogUrl))
   try:
      count = int(input('Visiting Count:'))
   except ValueError:
      logging.error('Arg error!')
      quit()
   if count == 0 or count > 200:
      logging.error('Count illegal')
      quit()

   article_list = GetArticles(BlogUrl)
   if len(article_list) == 0:
      logging.error('No articles, eror!')
      quit()

   for each_link in article_list:
      if not 'https://blog.csdn.net' in each_link:
         each_link = 'https://blog.csdn.net' + each_link
      article_list.append(each_link)
   # 多線程
   pool = ThreadPool(int(len(article_list) / 4))
   results = pool.map(VisitLoop, article_list)
   pool.close()
   pool.join()
   logging.DEBUG('Task Done')

多進程編程與異步爬蟲抓取

上面用的是ThreadPool,其實並沒有多進程高效,下面的例子是使用單進程與多進程的對比

import requests
import datetime
import multiprocessing as mp

def crawl(url, data): # 訪問
  text = requests.get(url=url, params=data).text
  return text

def func(page): # 執行抓取
  url = "https://book.douban.com/subject/4117922/comments/hot"
  data = {
    "p": page
  }
  text = crawl(url, data)
  print("Crawling : page No.{}".format(page))

if __name__ == '__main__':

  start = datetime.datetime.now()
  start_page = 1
  end_page = 50

  # 多進程抓取
  # pages = [i for i in range(start_page, end_page)]
  # p = mp.Pool()
  # p.map_async(func, pages)
  # p.close()
  # p.join()


  # 單進程抓取
  page = start_page

  for page in range(start_page, end_page):
    url = "https://book.douban.com/subject/4117922/comments/hot"
    # get參數
    data = {
      "p": page
    }
    content = crawl(url, data)
    print("Crawling : page No.{}".format(page))

  end = datetime.datetime.now()
  print("Time\t: ", end - start)

使用單進程速度爲:

Time	:  0:00:02.607331

更改爲多進程後速度:

Time	:  0:00:01.261326

速度在page越多的情況下,相差會越大

用異步的形式抓取數據

在提高抓取性能方面,還可以引入異步機制(比如asyncio庫、aiohttp庫等實現),這種方式利用了異步的原理,使得程序不必等待HTTP請求完成後再執行後續任務,在大批量網頁抓取中,這種異步的方式對於爬蟲性能尤爲重要

import aiohttp
import asyncio
# 使用aiohttp訪問網頁的例子
async def fetch(session, url):
  # 類似 requests.get
  async with session.get(url) as response:
    return await response.text()

# 通過async實現單線程併發IO
async def main():
  # 類似requests中的Session對象
  async with aiohttp.ClientSession() as session:
    html = await fetch(session, 'http://httpbin.org/headers')
    print(html)


loop = asyncio.get_event_loop()
loop.run_until_complete(main())

更多樣的爬蟲

編寫Scrapy爬蟲

請參考blog文:《用Python寫網絡爬蟲》讀書筆記3

常用的命令如下

  • startproject:創建一個新項目
  • genspider:根據模板生成一個新爬蟲
  • crawl:執行爬蟲
  • shell:啓動交互式抓取控制檯

新建一個Scrapy項目

scrapy startproject example

創建爬蟲

cd進入到example文件夾中,執行如下命令

scrapy genspider hupu_link https://bbs.hupu.com/xuefu --template=crawl

在目錄中出現一個文件hupu_link.py,裏面就是我們的代碼,補充上如下內容

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.http import Request
from ..items import HupuItem


class HupuLinkSpider(CrawlSpider):
    name = 'hupu_link'
    allowed_domains = ['https://bbs.hupu.com/']
    start_urls = ['https://bbs.hupu.com/xuefu-1/']

    def parse(self, response):
        # 本來這裏是要取到最後一頁的頁碼的,爲了簡單,直接設置爲4頁
        page_max = 4
        append_urls = ['https://bbs.hupu.com/xuefu-%d' % (i + 1) for i in range(page_max)]
        for url in append_urls:
            yield Request(url, callback=self.parse_item,  dont_filter=True)

    # 得到當前頁面的title和link
    def parse_item(self, response):
        item = HupuItem()
        all_item_class = response.xpath("//div[@class='titlelink box']")
        for item_class in all_item_class:
            link = 'https://bbs.hupu.com' + item_class.xpath("a/@href").extract()[0]
            title = item_class.xpath("a/text()").extract()[0]
            item["title"] = title
            item["link"] = link
            yield item
            print(title, link)

再來修改items.py

import scrapy

class HupuItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()

pipelines.py,作用是把每一個items保存到csv中

import pandas as pd

class HupuPipeline(object):
    def process_item(self, item, spider):
        data = pd.DataFrame(dict(item), index=[0])
        data.to_csv('hupu.csv', mode='a+', index=False, sep=',', header=False, encoding="gbk")
        return item

在settings.py中,加入如下內容

ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
CONCURRENT_REQUESTS_PER_DOMAIN = 1
ITEM_PIPELINES = {
   'example.pipelines.HupuPipeline': 300,
}

在example文件夾下,用如下命令執行爬蟲

scrapy crawl hupu_link

可以看到如下內容

在這裏插入圖片描述

同時出現hupu.csv文件

在這裏插入圖片描述

Scrapyd

安裝使用如下命令

pip install scrapyd

執行直接在命令行運行scrapyd

在這裏插入圖片描述

默認情況下scrapyd 監聽 0.0.0.0:6800 端口,運行scrapyd 後在瀏覽器http://localhost:6800/ 即可查看到當前可以運行的項目:http://localhost:6800/

在這裏插入圖片描述

api是使用curl提交命令來操作scrapyd,官網參考文檔如下:https://scrapyd.readthedocs.io/en/latest/api.html

修改工程目錄下的 scrapy.cfg 文件,可以設定端口號個project,例如下面的

[deploy]
#url = http://localhost:6800/
#project = example
url = http://127.0.0.1:6800/
project = example

如果本機有多個項目,直接修改不同的端口號

官方還提供了一個scrapy-client,能夠更方便的執行打包(打包各位爲egg)和執行命令,不過後面使用的Gerapy比這個client更加方便

使用Gerapy部署和管理爬蟲

安裝命令

pip install gerapy

初始化gerapy

gerapy init

可以看到有一個gerapy文件夾

在這裏插入圖片描述

進入該文件夾後執行數據庫初始化命令,在木事生成一個dbs

gerapy migrate

創建用戶名和密碼

gerapy createsuperuser

啓動服務

gerapy runserver

在瀏覽器中訪問

http://localhost:8000/

登錄後的界面如下

在這裏插入圖片描述

添加主機(在本機可以創建多個主機,只需要修改cfg文件的端口)

點擊主機管理選項卡下面的新建,填入如下內容

在這裏插入圖片描述

查看狀態是否正常,如果不正常需要修改

在這裏插入圖片描述

把scrapy項目拷貝到gerapy的projects文件夾下,可以在項目管理中看到項目

在這裏插入圖片描述

點擊部署,先去打包,左邊會有打包的信息

在這裏插入圖片描述

批量部署到主機中

在這裏插入圖片描述

可以看到項目example下有對應的egg文件

在這裏插入圖片描述

再回到主機管理中,最後有一個調度,就可以開始執行任務了

在這裏插入圖片描述

運行後可以看到如下信息,執行結束後,可以看到還有結束時間

在這裏插入圖片描述

這裏的顯示內容,跟scrapyd的log是一樣的

在這裏插入圖片描述

爬蟲實踐:下載網頁中的小說和購物評論

爬取小說網的內容

首先新建一個chrome_driver,打開小說的全部章節link放到list中,循環此list,按個去獲取原文內容,保存到txt中

import selenium.webdriver, time, re
from selenium.common.exceptions import WebDriverException

class NovelSpider():
  def __init__(self, url):
    self.homepage = url
    self.driver = selenium.webdriver.Chrome(r'C:\Users\zeng\AppData\Local\Google\Chrome\Application\chromedriver.exe')
    self.page_list = []

  def __del__(self):
    self.driver.quit()

  def get_page_urls(self): # 獲取章節頁面的列表
    homepage = self.homepage
    self.driver.get(homepage)
    self.driver.save_screenshot('screenshot.png') # 截圖保存網頁

    self.driver.implicitly_wait(5)
    elements = self.driver.find_elements_by_tag_name('a')

    for one in elements:
      page_url = one.get_attribute('href')

      pattern = '^http:\/\/book\.zhulang\.com\/\d{6}\/\d+\.html'
      if re.match(pattern, page_url):
        print(page_url)
        self.page_list.append(page_url)

  def looping_crawl(self):
    homepage = self.homepage
    filename = self.get_novel_name(homepage) + '.txt'
    self.get_page_urls() # 得到小說的全部章節
    pages = self.page_list
    # print(pages)

    for page in pages: # 循環章節link
      self.driver.get(page)
      print('Next page:')

      self.driver.implicitly_wait(3)
      title = self.driver.find_element_by_tag_name('h2').text
      res = self.driver.find_element_by_id('read-content')
      text = '\n' + title + '\n'
      for one in res.find_elements_by_xpath('./p'):
        text += one.text
        text += '\n'

      self.text_to_txt(text, filename) # 取出章節信息,存放到txt中
      time.sleep(1)
      print(page + '\t\t\tis Done!')

  def get_novel_name(self, homepage): # 獲取書名並轉化爲TXT文件名
    self.driver.get(homepage)
    self.driver.implicitly_wait(2) # 隱式等待,設置了一個最長等待時間,如果在規定時間內網頁加載完成,則執行下一步

    res = self.driver.find_element_by_tag_name('strong').find_element_by_xpath('./a')
    if res is not None and len(res.text) > 0:
      return res.text
    else:
      return 'novel'

  def text_to_txt(self, text, filename): # 保存抓取到的正文內容
    if filename[-4:] != '.txt':
      print('Error, incorrect filename')
    else:
      with open(filename, 'a') as fp:
        fp.write(text)
        fp.write('\n')


if __name__ == '__main__':
  # hp_url = input('輸入小說“全部章節”頁面:')
  hp_url = "http://book.zhulang.com/740053/"
  try:
    sp1 = NovelSpider(hp_url)
    sp1.looping_crawl()
    del sp1
  except WebDriverException as e:
    print(e.msg)

下載京東購物評論

通過chrome得到京東獲取評論的鏈接,進行拼接,就能得到上面的評論

在這裏插入圖片描述

使用requests.session能保持上一次的會話,在用requests.get請求時,加上params而不是在url後面拼接,顯得更加高級,比如下面的代碼

p_data = {
  'callback': 'fetchJSON_comment98vv242411',
  'score': 0,
  'sortType': 3,
  'page': 0,
  'pageSize': 10,
  'isShadowSku': 0,
}
response = ses.get(comment_json_url, params=p_data)

從json中得到評論信息後,保存到csv文件中,對獲取到的所有評論追加到content_sentences字符串中,接着用jieba.analyse.extract_tags提取前20個關鍵詞信息、示例全部代碼如下

import requests, json, time, logging, random, csv, lxml.html, jieba.analyse
from pprint import pprint
from datetime import datetime


# 京東評論 JS
class JDComment():
  _itemurl = ''

  def __init__(self, url, page):
    self._itemurl = url
    self._checkdate = None
    logging.basicConfig(
      # filename='app.log',
      level=logging.INFO,
    )
    self.content_sentences = ''
    self.max_page = page

  def go_on_check(self, date, page):
    go_on = self.date_check(date) and page <= self.max_page
    return go_on

  def set_checkdate(self, date):
    self._checkdate = datetime.strptime(date, '%Y-%m-%d')

  def get_comment_from_item_url(self):
    comment_json_url = 'https://sclub.jd.com/comment/productPageComments.action'
    p_data = {
      'callback': 'fetchJSON_comment98vv242411',
      'score': 0,
      'sortType': 3,
      'page': 0,
      'pageSize': 10,
      'isShadowSku': 0,
    }
    p_data['productId'] = self.item_id_extracter_from_url(self._itemurl)
    ses = requests.session()
    go_on = True
    while go_on:
      response = ses.get(comment_json_url, params=p_data)
      logging.info('-' * 10 + 'Next page!' + '-' * 10)
      if response.ok:

        r_text = response.text
        r_text = r_text[r_text.find('({') + 1:]
        r_text = r_text[:r_text.find(');')]
        # print(r_text)
        js1 = json.loads(r_text)

        # print(js1['comments'])
        for comment in js1['comments']:
          go_on = self.go_on_check(comment['referenceTime'], p_data['page'])
          logging.info('{}\t{}\t{}\t'.format(comment['content'], comment['referenceTime'],
                                               comment['nickname']))

          self.content_process(comment)
          self.content_sentences += comment['content']

      else:
        logging.error('Status NOT OK')
        break

      p_data['page'] += 1
      self.random_sleep()  # delay

  def item_id_extracter_from_url(self, url):
    item_id = 0

    prefix = 'item.jd.com/'
    index = str(url).find(prefix)
    if index != -1:
      item_id = url[index + len(prefix): url.find('.html')]

    if item_id != 0:
      return item_id

  def date_check(self, date_here):
    if self._checkdate is None:
      logging.warning('You have not set the checkdate')
      return True
    else:
      dt_tocheck = datetime.strptime(date_here, '%Y-%m-%d %H:%M:%S')
      if dt_tocheck > self._checkdate:
        return True
      else:
        logging.error('Date overflow')
        return False

  def content_process(self, comment):
    with open('jd-comments-res.csv', 'a') as csvfile:
      writer = csv.writer(csvfile, delimiter=',')
      writer.writerow([comment['content'], comment['referenceTime'],
                       comment['nickname']])

  def random_sleep(self, gap=1.0):
    # gap = 1.0
    bias = random.randint(-20, 20)
    gap += float(bias) / 100
    time.sleep(gap)

  def get_keywords(self):
    content = self.content_sentences
    kws = jieba.analyse.extract_tags(content, topK=20)
    return kws

if __name__ == '__main__':
  # url = input("輸入商品鏈接:")
  url = "https://item.jd.com/6088552.html"
  # date_str = input("輸入限定日期:")
  date_str = "2019-10-10"
  # page_num = int(input("輸入最大爬取頁數:"))
  page_num = int(4)
  jd1 = JDComment(url, page_num)
  jd1.set_checkdate(date_str)
  print(jd1.get_comment_from_item_url())
  print(jd1.get_keywords())

檢查日期是大於還是小於也比較有意思,使用的是下面的代碼(如果是我來操作,可能是先轉成linux時間戳再比較大小)

dt_tocheck = datetime.strptime(date_here, '%Y-%m-%d %H:%M:%S')
if dt_tocheck > self._checkdate:
    return True

爬蟲實踐:保存感興趣的圖片

本例子是通過登錄douban,訪問個人的觀影記錄,下載對應的pic(由於本人沒有這些信息,所以跳過)。常規的操作,訪問login鏈接登錄豆瓣,如果出現驗證碼讓,執行show_an_online_img,顯示出來,讓用戶進行輸入

def show_an_online_img(self, url):
    path = self.download_img(url, 'online_img')
    img = Image.open(path)
    img.show()
    os.remove(path)

調用部分代碼爲:

if len(response1.xpath('//*[@id="captcha_image"]')) > 0:
    self._captcha_url = response1.xpath('//*[@id="captcha_image"]/@src')[0]
    print(self._captcha_url)
    self.show_an_online_img(url=self._captcha_url)
    captcha_value = input("輸入圖中的驗證碼")
    login_data['captcha-solution'] = captcha_value

爬蟲實踐:網上影評分析

下面的例子是獲取豆瓣電影上面的某一個電影,取出前15的評論,用jieba提起詞頻

import jieba, numpy, re, time, matplotlib, requests, logging, snownlp, threading
import pandas as pd
from pprint import pprint
from bs4 import BeautifulSoup
from matplotlib import pyplot as plt
from queue import Queue
from lxml.html import fromstring

matplotlib.rcParams['font.sans-serif'] = ['KaiTi']
matplotlib.rcParams['font.serif'] = ['KaiTi']

HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
           'Accept-Language': 'zh-CN,zh;q=0.8',
           'Connection': 'keep-alive',
           'Cache-Control': 'max-age=0',
           'Upgrade-Insecure-Requests': '1',
           'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36',
           }
NOW_PLAYING_URL = 'https://movie.douban.com/nowplaying/beijing/'
logging.basicConfig(level=logging.DEBUG)


class MyThread(threading.Thread):
   CommentList = []
   Que = Queue()

   def __init__(self, i, MovieID):
      super(MyThread, self).__init__()
      self.name = '{}th thread'.format(i)
      self.movie = MovieID

   def run(self):
      logging.debug('Now running:\t{}'.format(self.name))
      while not MyThread.Que.empty():
         page = MyThread.Que.get()
         commentList_temp = GetCommentsByID(self.movie, page + 1)
         MyThread.CommentList.append(commentList_temp)
         MyThread.Que.task_done()


def MovieURLtoID(url):
   res = int(re.search('(\D+)(\d+)(\/)', url).group(2))
   return res

def GetCommentsByID(MovieID, PageNum):
   result_list = []
   if PageNum > 0:
      start = (PageNum - 1) * 20
   else:
      logging.error('PageNum illegal!')
      return False

   url = 'https://movie.douban.com/subject/{}/comments?start={}&limit=20'.format(MovieID, str(start))
   logging.debug('Handling :\t{}'.format(url))
   resp = requests.get(url,headers=HEADERS)
   resp.content.decode('utf-8')
   html_data = resp.text
   tree = fromstring(html_data)
   all_comments = tree.xpath("//span[@class='short']//text()")
   for comment in all_comments:
       result_list.append(comment)
   time.sleep(2) # Pause for several seconds
   return result_list

def DFGraphBar(df):
   df.plot(kind="bar", title='Words Freq', x='seg', y='freq')
   plt.savefig("output.jpg")
   plt.show()

def WordFrequence(MaxPage=15, ThreadNum=8, movie=None):
   # 循環獲取電影的評論
   if not movie:
      logging.error('No movie here')
      return
   else:
      MovieID = movie

   for page in range(MaxPage):
      MyThread.Que.put(page)

   threads = []
   for i in range(ThreadNum):
      work_thread = MyThread(i, MovieID)
      work_thread.setDaemon(True)
      threads.append(work_thread)
   for thread in threads:
      thread.start()

   MyThread.Que.join()
   CommentList = MyThread.CommentList

   # print(CommentList)
   comments = ''
   for one in range(len(CommentList)):
      new_comment = (str(CommentList[one])).strip()
      new_comment = re.sub('[-\\ \',\.n()#…/\n\[\]!~]', '', new_comment)
      # 使用正則表達式清洗文本,主要是去除一些標點
      comments = comments + new_comment

   pprint(SumOfComment(comments)) # 輸出文本摘要
   # 中文分詞
   segments = jieba.lcut(comments)
   WordDF = pd.DataFrame({'seg': segments})

   # 去除停用詞
   stopwords = pd.read_csv("stopwordsChinese.txt",
                           index_col=False,
                           names=['stopword'],
                           encoding='utf-8')

   WordDF = WordDF[~WordDF.seg.isin(stopwords.stopword)]  # 取反

   # 統計詞頻
   WordAnal = WordDF.groupby(by=['seg'])['seg'].agg({'freq': numpy.size})
   WordAnal = WordAnal.reset_index().sort_values(by=['freq'], ascending=False)
   WordAnal = WordAnal[0:40]  # 僅取前40個高頻詞

   print(WordAnal)
   return WordAnal


def SumOfComment(comment):
   s = snownlp.SnowNLP(comment)
   sum = s.summary(5)
   return sum

# 執行函數
if __name__ == '__main__':
   DFGraphBar(WordFrequence(movie=MovieURLtoID('https://movie.douban.com/subject/1291575/')))

所輸出的詞頻信息如下(因爲停用詞字典不太好,有些詞完全是沒有意義的)

在這裏插入圖片描述

爬蟲實踐:使用PySpider爬蟲框架

pip install pyspider

進行安裝

安裝後,用

pyspider all

啓動服務,從http://localhost:5000/訪問服務,啓動服務時候可能會遇到問題,那麼請參考:解決pyspider:ValueError: Invalid configuration

在這裏插入圖片描述

下面的示例獲取hupu上面的鏈接

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2020-04-23 21:06:24
# Project: hupu

from pyspider.libs.base_handler import *
import re

class Handler(BaseHandler):
  crawl_config = {
  }

  @every(minutes=24 * 60)
  def on_start(self):
    self.crawl('https://bbs.hupu.com/xuefu', fetch_type='js', callback=self.index_page)

  @config(age=10 * 24 * 60 * 60)
  def index_page(self, response):
    for each in response.doc('a[href^=http]').items():
      url = each.attr.href
      if re.match(r'^http\S*://bbs.hupu.com/\d+.html$', url):
        self.crawl(url, fetch_type='js', callback=self.detail_page)

    next_page_url = response.doc(
      '#container > div > div.bbsHotPit > div.showpage > div.page.downpage > div > a.nextPage').attr.href

    if int(next_page_url[-1]) > 30:
      raise ValueError

    self.crawl(next_page_url,
               fetch_type='js',
               callback=self.index_page)

  @config(priority=2)
  def detail_page(self, response):
    return {
      "url": response.url,
      "title": response.doc('#j_data').text(),
    }

將代碼拷貝到右側,save保存,左邊點擊run,選擇底下的follows,最後綠色箭頭運行

在這裏插入圖片描述

可以看到如下結果,detail_page方法獲得其標題和鏈接,index_page方法則執行抓取任務,你還可以選擇下一頁等

在這裏插入圖片描述

回到主頁面中,選擇status爲RUNNING,再執行Run,任務就開始了

在這裏插入圖片描述

稍微等一下,點擊頁面的result,就能看到結果了

在這裏插入圖片描述

全文所涉及的代碼下載地址

https://download.csdn.net/download/zengraoli/12366948

參考鏈接

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章