rabbitmq 爬蟲

Exchange模式

RabbitMQ提供了四種Exchange:fanout,direct,topic,header,常用的是fanout,direct,topic

Direct

  • 消息傳遞時需要一個“routing_key”,可以簡單的理解爲要發送到的隊列名字。
  • 這種模式下不需要將Exchange進行任何綁定(binding)操作

接收端

# # -*- coding: utf-8 -*-
import pika

connection = pika.BlockingConnection(pika.ConnectionParameters(host="0.0.0.0", virtual_host="/"))
channel = connection.channel()

channel.exchange_declare(exchange='direct_logs', type='direct')
result = channel.queue_declare(durable=True, queue="direct_key")
def callback(ch, method, properties, body):
    print " [x] Received %s routing_key %s" % (body, method.routing_key)
    ch.basic_ack(method.delivery_tag)


channel.basic_consume(callback, queue=result.method.queue)
channel.start_consuming()

發送端

# # -*- coding: utf-8 -*-
import pika

connection = pika.BlockingConnection(pika.ConnectionParameters(host="0.0.0.0", virtual_host="/"))
channel = connection.channel()
channel.exchange_declare(exchange='direct_logs', type='direct')
channel.basic_publish(exchange='direct_logs',
                      routing_key='k1',
                      body="22222222",
                      properties=pika.BasicProperties(
                          delivery_mode=2,
                      ))
channel.basic_publish(exchange='direct_logs',
                      routing_key='k2',
                      body="22222222",
                      properties=pika.BasicProperties(
                          delivery_mode=2,
                      ))

Fanout

  • 這種模式不需要routing_key
  • 這種模式需要提前將Exchange與Queue進行綁定,一個Exchange可以綁定多個Queue,一個Queue可以同多個Exchange進行綁定。

接收端

# # -*- coding: utf-8 -*-
import pika

connection = pika.BlockingConnection(pika.ConnectionParameters(host="0.0.0.0", virtual_host="/"))
channel = connection.channel()
channel.exchange_declare(exchange='fanout_logs', type='fanout')
result = channel.queue_declare(durable=True)
channel.queue_bind(exchange='fanout_logs', queue=result.method.queue)

def callback(ch, method, properties, body):
    print " [x] Received %s routing_key %s" % (body, method.routing_key)
    ch.basic_ack(method.delivery_tag)


channel.basic_consume(callback, queue=result.method.queue)
channel.start_consuming()

發送端

# # -*- coding: utf-8 -*-
import pika

connection = pika.BlockingConnection(pika.ConnectionParameters(host="0.0.0.0", virtual_host="/"))
channel = connection.channel()
channel.exchange_declare(exchange='fanout_logs', type='fanout')
channel.basic_publish(exchange='fanout_logs',
                      routing_key='k1',
                      body="22222222",
                      properties=pika.BasicProperties(
                          delivery_mode=2,
                      ))
channel.basic_publish(exchange='fanout_logs',
                      routing_key='k2',
                      body="22222222",
                      properties=pika.BasicProperties(
                          delivery_mode=2,
                      ))
channel.basic_publish(exchange='fanout_logs',
                      routing_key='k3',
                      body="22222222",
                      properties=pika.BasicProperties(
                          delivery_mode=2,
                      ))

結果

 [x] Received 22222222 routing_key k1
 [x] Received 22222222 routing_key k2
 [x] Received 22222222 routing_key k3

Topic

  • 這種模式需要RouteKey,也許要提前綁定Exchange與Queue。
  • 在進行綁定時,要提供一個該隊列關心的主題,如*.log.*表示該隊列關心所有涉及log的消息(一個routing_key爲”a.log.error”的消息會被轉發到該隊列)。

接收端

# # -*- coding: utf-8 -*-
import pika

connection = pika.BlockingConnection(pika.ConnectionParameters(host="0.0.0.0", virtual_host="/"))
channel = connection.channel()
channel.exchange_declare(exchange="topic_logs", type='topic')
result = channel.queue_declare(durable=True)
channel.queue_bind(exchange="topic_logs", queue=result.method.queue, routing_key="*.log.*")
channel.queue_bind(exchange="topic_logs", queue=result.method.queue, routing_key="*.db.cc")

def callback(ch, method, properties, body):
    print " [x] Received %s routing_key %s" % (body, method.routing_key)
    ch.basic_ack(method.delivery_tag)


channel.basic_consume(callback, queue=result.method.queue)
channel.start_consuming()

發送端

# # -*- coding: utf-8 -*-
import pika


connection = pika.BlockingConnection(pika.ConnectionParameters(host="0.0.0.0", virtual_host="/"))
channel = connection.channel()
channel.exchange_declare(exchange='topic_logs', type='topic')
channel.basic_publish(exchange='topic_logs',
                      routing_key='user.log.error',
                      body="22222222",
                      properties=pika.BasicProperties(
                          delivery_mode=2,
                      ))
channel.basic_publish(exchange='topic_logs',
                      routing_key='user.log.success',
                      body="22222222",
                      properties=pika.BasicProperties(
                          delivery_mode=2,
                      ))
channel.basic_publish(exchange='topic_logs',
                      routing_key='ad.db.cc',
                      body="22222222",
                      properties=pika.BasicProperties(
                          delivery_mode=2,
                      ))

基於rabbitmq 簡單的分佈式爬蟲程序

架構

這裏寫圖片描述

  1. Download進程負責下載頁面
  2. ParseBase監聽Download下載完成的消息,解析頁面(URL,EMAIL,……)

使用supervisor 管理進程
使用fabfile部署代碼

簡單版代碼

https://github.com/neo-hu/rabbitmq-crawler

完整版

下載:頻率修改,代理(翻牆)設置
頁面解析:關鍵字,分詞統計等
web管理頁面等功能

發佈了54 篇原創文章 · 獲贊 13 · 訪問量 8萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章