文章開始,先摘錄一下文中各軟件的官方定義
Scrapy:An open source and collaborative framework for extracting the data you
need from websites.In a fast, simple, yet extensible way.Scarpyd:Scrapy comes with a built-in service, called “Scrapyd”, which allows
you to deploy (aka. upload) your projects and control their spiders
using a JSON web service.ScrapydWeb:A full-featured web UI for Scrapyd cluster management,
with Scrapy log analysis & visualization supported.Docker Container: A container is a standard unit of software that packages up code and
all its dependencies so the application runs quickly and reliably from
one computing environment to another. A Docker container image is a
lightweight, standalone, executable package of software that includes
everything needed to run an application: code, runtime, system tools,
system libraries and settings.
dockerfile的內容基於 aciobanu/scrapy 修改
FROM alpine:latest
# 使用國內鏡像
RUN echo "https://mirror.tuna.tsinghua.edu.cn/alpine/latest-stable/main/" > /etc/apk/repositories
RUN apk update \
&& apk upgrade
RUN apk -U add \
gcc \
bash \
bash-doc \
bash-completion \
libffi-dev \
libxml2-dev \
libxslt-dev \
musl-dev \
openssl-dev \
python-dev \
py-imaging \
py-pip \
curl ca-certificates \
&& update-ca-certificates \
&& rm -rf /var/cache/apk/* \
&& pip install --upgrade pip \
&& pip install Scrapy
WORKDIR /runtime/app
# 以下是額外安裝的組件
RUN pip install scrapyd \
&& pip install scrapyd-client \
&& pip install scrapydweb
# 以下是具體爬蟲項目需要的組件, 可忽略
RUN pip install fake_useragent \
&& pip install scrapy_proxies \
&& pip install sqlalchemy \
&& pip install mongoengine
# 容器中暴露端口,scrapydweb:5000
EXPOSE 5000
目錄結構爲
根目錄(/usr/local/src/scrapy-d-web:/runtime/app)
Dockerfile - 編輯完後需要執行[docker build -t scrapy-d-web:v1 .]生成鏡像
scrapyd - 存放scrapyd的配置文件和其他目錄
scrapydweb - 存放scrapydweb的配置文件
knowsmore - scrapy startproject 新建的工程目錄1
pxn - scrapy startproject 新建的工程目錄2
現在我們手動啓動各個服務來逐步解釋, 首先啓動容器並進入bash
docker network create --subnet=192.168.0.0/16 mynetwork #新建一個自定義網絡
docker run -it --rm --net mynetwork --ip 192.168.1.100 --name scrapyd -p 5000:5000 -v /usr/local/src/scrapy-d-web/:/runtime/app scrapy-d-web:v1 /bin/sh #定義網絡地址,容器名稱;建立目錄映射,端口映射
進入scrapyd.conf文件所在目錄(/runtime/app/scrapyd),啓動scrapyd,配置文件的生效順序請查閱scrapyd官方文檔,下文爲官方配置文件示例
[scrapyd]
eggs_dir = eggs
logs_dir = logs
items_dir =
jobs_to_keep = 5
dbs_dir = dbs
max_proc = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5.0
bind_address = 127.0.0.1
http_port = 6800
debug = off
runner = scrapyd.runner
application = scrapyd.app.application
launcher = scrapyd.launcher.Launcher
webroot = scrapyd.website.Root
[services]
schedule.json = scrapyd.webservice.Schedule
cancel.json = scrapyd.webservice.Cancel
addversion.json = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json = scrapyd.webservice.ListSpiders
delproject.json = scrapyd.webservice.DeleteProject
delversion.json = scrapyd.webservice.DeleteVersion
listjobs.json = scrapyd.webservice.ListJobs
再次打開一個終端進入上面的docker容器, 進入scrapydweb配置文件所在的目錄(/runtime/app/scrapydweb), 啓動scrapydweb
docker exec -it scrapyd /bin/bash
項目請查看github地址,下文爲我的配置
############################## ScrapydWeb #####################################
# Setting SCRAPYDWEB_BIND to '0.0.0.0' or IP-OF-CURRENT-HOST would make
# ScrapydWeb server visible externally, otherwise, set it to '127.0.0.1'.
# The default is '0.0.0.0'.
SCRAPYDWEB_BIND = '0.0.0.0'
# Accept connections on the specified port, the default is 5000.
SCRAPYDWEB_PORT = 5000
# The default is False, set it to True to enable basic auth for web UI.
ENABLE_AUTH = True
# In order to enable basic auth, both USERNAME and PASSWORD should be non-empty strings.
USERNAME = 'user'
PASSWORD = 'pass'
############################## Scrapy #########################################
# ScrapydWeb is able to locate projects in the SCRAPY_PROJECTS_DIR,
# so that you can simply select a project to deploy, instead of eggifying it in advance.
# e.g., 'C:/Users/username/myprojects/' or '/home/username/myprojects/'
SCRAPY_PROJECTS_DIR = '/runtime/app/'
############################## Scrapyd ########################################
# Make sure that [Scrapyd](https://github.com/scrapy/scrapyd) has been installed
# and started on all of your hosts.
# Note that for remote access, you have to manually set 'bind_address = 0.0.0.0'
# in the configuration file of Scrapyd and restart Scrapyd to make it visible externally.
# Check out 'https://scrapyd.readthedocs.io/en/latest/config.html#example-configuration-file' for more info.
# ------------------------------ Chinese --------------------------------------
# 請先確保所有主機都已經安裝和啓動 [Scrapyd](https://github.com/scrapy/scrapyd)。
# 如需遠程訪問 Scrapyd,則需在 Scrapyd 配置文件中設置 'bind_address = 0.0.0.0',然後重啓 Scrapyd。
# 詳見 https://scrapyd.readthedocs.io/en/latest/config.html#example-configuration-file
# - the string format: username:password@ip:port#group
# - The default port would be 6800 if not provided,
# - Both basic auth and group are optional.
# - e.g., '127.0.0.1' or 'username:[email protected]:6801#group'
# - the tuple format: (username, password, ip, port, group)
# - When the username, password, or group is too complicated (e.g., contains ':@#'),
# - or if ScrapydWeb fails to parse the string format passed in,
# - it's recommended to pass in a tuple of 5 elements.
# - e.g., ('', '', '127.0.0.1', '', '') or ('username', 'password', '192.168.123.123', '6801', 'group')
SCRAPYD_SERVERS = [
'192.168.1.100:6800',
# 'username:password@localhost:6801#group',
# ('username', 'password', 'localhost', '6801', 'group'),
]
# If the IP part of a Scrapyd server is added as '127.0.0.1' in the SCRAPYD_SERVERS above,
# ScrapydWeb would try to read Scrapy logs directly from disk, instead of making a request
# to the Scrapyd server.
# Check out this link to find out where the Scrapy logs are stored:
# https://scrapyd.readthedocs.io/en/stable/config.html#logs-dir
# e.g., 'C:/Users/username/logs/' or '/home/username/logs/'
SCRAPYD_LOGS_DIR = '/runtime/app/scrapyd/logs/'
訪問 http://[YOUR IP ADDRESS]:5000 即可