背景:
Python版本:Anaconda3
數據庫:MongoDB
爬蟲框架:Scrapy
IDE:PyCharm
前言:
前面我們已經安裝和配置好Mongodb,接下來讓我們在編程中獲得對Mongodb更多的瞭解。
MongoDB安裝教程:Python網絡爬蟲(六):MongoDB安裝和使用(windows)
一、可視化工具安裝、連接數據庫:
面對命令行的Mongodb查看數據不是特別方便,我們最好選擇一款可視化軟件,這裏我們選擇了RoboMongo。
這是是RoboMongo的下載鏈接:
https://robomongo.org/download
安裝過程很簡單,一路next就可以完成了,也可以自定義你的安裝路徑:
安裝完成後,我們首先啓動Mongodb,然後打開RoboMongo。
RoboMongo連接數據庫:
(1)創建新的Mongodb連接:
(2)自定義連接名稱:
Name是連接名稱,這個可以隨便填,Address是連接數據庫的ip地址,這裏是本地,所以是localhost,後面是端口號,Mongodb默認的端口號是27017。
因爲我沒有爲我的Mongodb數據庫設置用戶名和密碼,如果你設置了數據庫的用戶名和密碼,則要在Authentication中設置:
完成後,我們可以看到左面工作欄已經有了我們數據庫的信息。
二、牛刀小試:
我們曾用scrapy爬取過筆趣閣中的一本小說遮天:
Python網絡爬蟲(五):Scrapy框架安裝、介紹、以及實戰
不過上次我們是保存在了txt文件中,這次我們將下載的小說存入MongoDB數據庫中。
1、在程序中如何連接MongoDB 數據庫:
#獲取鏈接
client = MongoClient("mongodb://127.0.0.1:27017")
#連接sdust數據庫
db =client.sdust
#連接集合名
zhetian=db.zhetian
其中sdust是我MongoDB中創建的一個數據庫,zhetian是我sdust數據庫的中一個集合。連接數據庫後就可以對數據庫進行各項操作。
2、完整項目:
項目視圖:
這裏把完整項目代碼貼出來:
biqukan.py
# -*- coding: utf-8 -*-
from zhetian import settings
from bs4 import BeautifulSoup
import os
from urllib import request
from pymongo import MongoClient
class ZhetianPipeline(object):
def process_item(self, item, spider):
#如果獲取了章節鏈接,進行如下操作
if "link_url" in item:
#獲取鏈接
client = MongoClient("mongodb://127.0.0.1:27017")
#連接數據庫名
db =client.sdust
#連接集合名
zhetian=db.zhetian
response = request.Request(url =item['link_url'])
download_response = request.urlopen(response)
download_html = download_response.read().decode('gbk', 'ignore')
soup_texts = BeautifulSoup(download_html, 'lxml')
texts = soup_texts.find_all(id='content', class_='showtxt')
soup_text = BeautifulSoup(str(texts), 'lxml')
write_flag = True
string=''
# 將爬取內容寫入文件
for each in soup_text.div.text.replace('\xa0', ''):
if each == 'h':
write_flag = False
if write_flag == True and each != ' ':
string+=each
if write_flag == True and each == '\r':
string+='\n'
zhetian.insert({"dir_name": item['dir_name'], "dir_url": "link_url","content":string})
return item
items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class ZhetianItem(scrapy.Item):
# define the fields for your item here like:
#每個章節的章節名
dir_name = scrapy.Field()
#每個章節的章節鏈接
link_url = scrapy.Field()
settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for zhetian project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'zhetian'
SPIDER_MODULES = ['zhetian.spiders']
NEWSPIDER_MODULE = 'zhetian.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'zhetian (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
'zhetian.pipelines.ZhetianPipeline': 1,
}
TEXT_STORE='F:/爬取的文件/遮天'
#Cookie使能,這裏禁止Cookie;
COOKIES_ENABLED = False
#下載延時,這裏使用250ms延時。
DOWNLOAD_DELAY = 0.25 # 250 ms of delay
pipelines.py
# -*- coding: utf-8 -*-
from zhetian import settings
from bs4 import BeautifulSoup
import os
from urllib import request
from pymongo import MongoClient
class ZhetianPipeline(object):
def process_item(self, item, spider):
#如果獲取了章節鏈接,進行如下操作
if "link_url" in item:
#獲取鏈接
client = MongoClient("mongodb://127.0.0.1:27017")
#連接數據庫名
db =client.sdust
#連接集合名
zhetian=db.zhetian
response = request.Request(url =item['link_url'])
download_response = request.urlopen(response)
download_html = download_response.read().decode('gbk', 'ignore')
soup_texts = BeautifulSoup(download_html, 'lxml')
texts = soup_texts.find_all(id='content', class_='showtxt')
soup_text = BeautifulSoup(str(texts), 'lxml')
write_flag = True
string=''
# 將爬取內容寫入文件
for each in soup_text.div.text.replace('\xa0', ''):
if each == 'h':
write_flag = False
if write_flag == True and each != ' ':
string+=each
if write_flag == True and each == '\r':
string+='\n'
zhetian.insert({"dir_name": item['dir_name'], "dir_url": "link_url","content":string})
return item
3、運行:
打開Anaconda Prompt,進入項目所在目錄,鍵入命令,運行項目。
scrapy crawl biqukan
項目開始運行:
RoboMongo已經可以看到下載的小說: