Python網絡爬蟲（八）：Scrapy中MongoDB數據庫的簡單使用（windows）

原創

马衍硕

2020-06-17 11:00

背景：

Python版本：Anaconda3
數據庫：MongoDB
爬蟲框架：Scrapy
IDE：PyCharm

前言：

前面我們已經安裝和配置好Mongodb，接下來讓我們在編程中獲得對Mongodb更多的瞭解。
MongoDB安裝教程：Python網絡爬蟲（六）：MongoDB安裝和使用（windows）

一、可視化工具安裝、連接數據庫：

面對命令行的Mongodb查看數據不是特別方便，我們最好選擇一款可視化軟件，這裏我們選擇了RoboMongo。
這是是RoboMongo的下載鏈接：

https://robomongo.org/download

安裝過程很簡單，一路next就可以完成了，也可以自定義你的安裝路徑：

安裝完成後，我們首先啓動Mongodb，然後打開RoboMongo。

RoboMongo連接數據庫：

（1）創建新的Mongodb連接：

（2）自定義連接名稱：

Name是連接名稱，這個可以隨便填，Address是連接數據庫的ip地址，這裏是本地，所以是localhost，後面是端口號，Mongodb默認的端口號是27017。

因爲我沒有爲我的Mongodb數據庫設置用戶名和密碼，如果你設置了數據庫的用戶名和密碼，則要在Authentication中設置：

完成後，我們可以看到左面工作欄已經有了我們數據庫的信息。

二、牛刀小試：

我們曾用scrapy爬取過筆趣閣中的一本小說遮天：
Python網絡爬蟲（五）：Scrapy框架安裝、介紹、以及實戰

不過上次我們是保存在了txt文件中，這次我們將下載的小說存入MongoDB數據庫中。

1、在程序中如何連接MongoDB 數據庫：

            #獲取鏈接
            client = MongoClient("mongodb://127.0.0.1:27017")
            #連接sdust數據庫
            db =client.sdust
            #連接集合名
            zhetian=db.zhetian

其中sdust是我MongoDB中創建的一個數據庫，zhetian是我sdust數據庫的中一個集合。連接數據庫後就可以對數據庫進行各項操作。

2、完整項目：

項目視圖：

這裏把完整項目代碼貼出來：

biqukan.py

# -*- coding: utf-8 -*-

from zhetian import settings
from bs4 import BeautifulSoup
import os
from urllib import request
from pymongo import MongoClient

class ZhetianPipeline(object):
    def process_item(self, item, spider):
        #如果獲取了章節鏈接，進行如下操作
        if "link_url" in item:
            #獲取鏈接
            client = MongoClient("mongodb://127.0.0.1:27017")
            #連接數據庫名
            db =client.sdust
            #連接集合名
            zhetian=db.zhetian


            response = request.Request(url =item['link_url'])
            download_response = request.urlopen(response)
            download_html = download_response.read().decode('gbk', 'ignore')
            soup_texts = BeautifulSoup(download_html, 'lxml')
            texts = soup_texts.find_all(id='content', class_='showtxt')
            soup_text = BeautifulSoup(str(texts), 'lxml')
            write_flag = True
            string=''
            # 將爬取內容寫入文件
            for each in soup_text.div.text.replace('\xa0', ''):
                if each == 'h':
                    write_flag = False
                if write_flag == True and each != ' ':
                    string+=each
                if write_flag == True and each == '\r':
                    string+='\n'

            zhetian.insert({"dir_name": item['dir_name'], "dir_url": "link_url","content":string})


        return item

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ZhetianItem(scrapy.Item):
    # define the fields for your item here like:
    #每個章節的章節名
    dir_name = scrapy.Field()
    #每個章節的章節鏈接
    link_url = scrapy.Field()

settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for zhetian project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'zhetian'

SPIDER_MODULES = ['zhetian.spiders']
NEWSPIDER_MODULE = 'zhetian.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'zhetian (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False


ITEM_PIPELINES = {
   'zhetian.pipelines.ZhetianPipeline': 1,
}
TEXT_STORE='F:/爬取的文件/遮天'

#Cookie使能，這裏禁止Cookie;
COOKIES_ENABLED = False

#下載延時，這裏使用250ms延時。
DOWNLOAD_DELAY = 0.25    # 250 ms of delay

pipelines.py

# -*- coding: utf-8 -*-

from zhetian import settings
from bs4 import BeautifulSoup
import os
from urllib import request
from pymongo import MongoClient

class ZhetianPipeline(object):
    def process_item(self, item, spider):
        #如果獲取了章節鏈接，進行如下操作
        if "link_url" in item:
            #獲取鏈接
            client = MongoClient("mongodb://127.0.0.1:27017")
            #連接數據庫名
            db =client.sdust
            #連接集合名
            zhetian=db.zhetian


            response = request.Request(url =item['link_url'])
            download_response = request.urlopen(response)
            download_html = download_response.read().decode('gbk', 'ignore')
            soup_texts = BeautifulSoup(download_html, 'lxml')
            texts = soup_texts.find_all(id='content', class_='showtxt')
            soup_text = BeautifulSoup(str(texts), 'lxml')
            write_flag = True
            string=''
            # 將爬取內容寫入文件
            for each in soup_text.div.text.replace('\xa0', ''):
                if each == 'h':
                    write_flag = False
                if write_flag == True and each != ' ':
                    string+=each
                if write_flag == True and each == '\r':
                    string+='\n'

            zhetian.insert({"dir_name": item['dir_name'], "dir_url": "link_url","content":string})


        return item

3、運行：

打開Anaconda Prompt,進入項目所在目錄，鍵入命令，運行項目。

scrapy crawl biqukan

項目開始運行：

RoboMongo已經可以看到下載的小說：

如果我的博客對您有幫助，麻煩動動小手點擊“頂”，感謝！

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python網絡爬蟲（八）：Scrapy中MongoDB數據庫的簡單使用（windows）

背景：

前言：

一、可視化工具安裝、連接數據庫：

RoboMongo連接數據庫：

二、牛刀小試：

1、在程序中如何連接MongoDB 數據庫：

2、完整項目：

3、運行：

如果我的博客對您有幫助，麻煩動動小手點擊“頂”，感謝！

JDBC那點事兒：（一）Class.forName("xxxxxxx"):

Python網絡爬蟲（八）：Scrapy中MongoDB數據庫的簡單使用（windows）

常用的Mysql語句整理

如何使元素裏的內容自動換行

TensorFlow學習（一）：基本概念

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結