結構圖
producer產生的message推送到kafka集羣的topic中,再由KafkaSpout來訂閱該topic中的message,並將獲得的message傳遞給WordSplitBolt處理,WordSplitBolt處理完成後繼續將message傳遞給WordCountBolt來處理,WordCountBolt處理完成之後可以繼續往下傳遞,或者直接推送給kafka集羣,讓consumer處理。storm集羣與kafka集羣之間的數據可以雙向傳遞。(kafka和storm集羣的搭建網上有很多詳細的參考資料,在這裏就不贅述了。)
源碼
pox.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>cn.com.dimensoft</groupId>
<artifactId>storm-kafka</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>storm-kafka</name>
<url>http://maven.apache.org</url>
<repositories>
<repository>
<id>clojars.org</id>
<url>http://clojars.org/repo</url>
</repository>
</repositories>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<!-- storm -->
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-core</artifactId>
<version>0.9.5</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.twitter4j</groupId>
<artifactId>twitter4j-stream</artifactId>
<version>3.0.3</version>
</dependency>
<dependency>
<groupId>commons-collections</groupId>
<artifactId>commons-collections</artifactId>
<version>3.2.1</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>13.0</version>
</dependency>
<!-- kafka -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.10</artifactId>
<version>0.8.2.1</version>
<!-- 這個必須要加,否則會有jar包衝突導致無法 -->
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
</dependency>
<!-- storm-kafka -->
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-kafka</artifactId>
<version>0.9.5</version>
</dependency>
</dependencies>
</project>
注:這裏一定要注意將slf4j-log4j12的依賴包去除,否則topology運行的時候本地和集羣都會出問題,本地報錯信息如下(提交到集羣錯誤信息也類似):
java.lang.NoClassDefFoundError: Could not initialize class org.apache.log4j.Log4jLoggerFactory
at org.apache.log4j.Logger.getLogger(Logger.java:39) ~[log4j-over-slf4j-1.6.6.jar:1.6.6]
at kafka.utils.Logging$class.logger(Logging.scala:24) ~[kafka_2.10-0.8.2.1.jar:na]
at kafka.consumer.SimpleConsumer.logger$lzycompute(SimpleConsumer.scala:30) ~[kafka_2.10-0.8.2.1.jar:na]
at kafka.consumer.SimpleConsumer.logger(SimpleConsumer.scala:30) ~[kafka_2.10-0.8.2.1.jar:na]
at kafka.utils.Logging$class.info(Logging.scala:67) ~[kafka_2.10-0.8.2.1.jar:na]
at kafka.consumer.SimpleConsumer.info(SimpleConsumer.scala:30) ~[kafka_2.10-0.8.2.1.jar:na]
at kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:74) ~[kafka_2.10-0.8.2.1.jar:na]
at kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:68) ~[kafka_2.10-0.8.2.1.jar:na]
at kafka.consumer.SimpleConsumer.getOffsetsBefore(SimpleConsumer.scala:127) ~[kafka_2.10-0.8.2.1.jar:na]
at kafka.javaapi.consumer.SimpleConsumer.getOffsetsBefore(SimpleConsumer.scala:79) ~[kafka_2.10-0.8.2.1.jar:na]
at storm.kafka.KafkaUtils.getOffset(KafkaUtils.java:77) ~[storm-kafka-0.9.5.jar:0.9.5]
at storm.kafka.KafkaUtils.getOffset(KafkaUtils.java:67) ~[storm-kafka-0.9.5.jar:0.9.5]
at storm.kafka.PartitionManager.<init>(PartitionManager.java:83) ~[storm-kafka-0.9.5.jar:0.9.5]
at storm.kafka.ZkCoordinator.refresh(ZkCoordinator.java:98) ~[storm-kafka-0.9.5.jar:0.9.5]
at storm.kafka.ZkCoordinator.getMyManagedPartitions(ZkCoordinator.java:69) ~[storm-kafka-0.9.5.jar:0.9.5]
at storm.kafka.KafkaSpout.nextTuple(KafkaSpout.java:135) ~[storm-kafka-0.9.5.jar:0.9.5]
at backtype.storm.daemon.executor$fn__3371$fn__3386$fn__3415.invoke(executor.clj:565) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.util$async_loop$fn__460.invoke(util.clj:463) ~[storm-core-0.9.5.jar:0.9.5]
at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
8789 [Thread-13-kafkaSpout] ERROR backtype.storm.util - Halting process: ("Worker died")
java.lang.RuntimeException: ("Worker died")
at backtype.storm.util$exit_process_BANG_.doInvoke(util.clj:325) [storm-core-0.9.5.jar:0.9.5]
at clojure.lang.RestFn.invoke(RestFn.java:423) [clojure-1.5.1.jar:na]
at backtype.storm.daemon.worker$fn__4694$fn__4695.invoke(worker.clj:493) [storm-core-0.9.5.jar:0.9.5]
at backtype.storm.daemon.executor$mk_executor_data$fn__3272$fn__3273.invoke(executor.clj:240) [storm-core-0.9.5.jar:0.9.5]
at backtype.storm.util$async_loop$fn__460.invoke(util.clj:473) [storm-core-0.9.5.jar:0.9.5]
at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
WordCountTopology
WordCountTopology是程序運行入口,定義了完整的topology以及topology運行的方式,本地或者集羣:
package cn.com.dimensoft.storm;
import cn.com.dimensoft.constant.Constant;
import storm.kafka.BrokerHosts;
import storm.kafka.KafkaSpout;
import storm.kafka.SpoutConfig;
import storm.kafka.StringScheme;
import storm.kafka.ZkHosts;
import backtype.storm.Config;
import backtype.storm.LocalCluster;
import backtype.storm.StormSubmitter;
import backtype.storm.generated.AlreadyAliveException;
import backtype.storm.generated.InvalidTopologyException;
import backtype.storm.spout.SchemeAsMultiScheme;
import backtype.storm.topology.TopologyBuilder;
import backtype.storm.tuple.Fields;
/**
*
* class: WordCountTopology
* package: cn.com.dimensoft.storm
* author:zxh
* time: 2015年10月8日 下午2:06:55
* description:
*/
public class WordCountTopology {
/**
*
* name:main
* author:zxh
* time:2015年10月8日 下午2:07:02
* description:
* @param args
* @throws AlreadyAliveException
* @throws InvalidTopologyException
* @throws InterruptedException
*/
public static void main(String[] args) throws AlreadyAliveException,
InvalidTopologyException, InterruptedException {
TopologyBuilder builder = new TopologyBuilder();
// BrokerHosts接口有2個實現類StaticHosts和ZkHosts,ZkHosts會定時(默認60秒)從ZK中更新brokers的信息,StaticHosts是則不會
// 要注意這裏的第二個參數brokerZkPath要和kafka中的server.properties中配置的zookeeper.connect對應
// 因爲這裏是需要在zookeeper中找到brokers znode
// 默認kafka的brokers znode是存儲在zookeeper根目錄下
BrokerHosts brokerHosts = new ZkHosts(Constant.ZOOKEEPER_STRING,
Constant.ZOOKEEPER_PTAH);
// 定義spoutConfig
// 第一個參數hosts是上面定義的brokerHosts
// 第二個參數topic是該Spout訂閱的topic名稱
// 第三個參數zkRoot是存儲消費的offset(存儲在ZK中了),當該topology故障重啓後會將故障期間未消費的message繼續消費而不會丟失(可配置)
// 第四個參數id是當前spout的唯一標識
SpoutConfig spoutConfig = new SpoutConfig(brokerHosts, //
Constant.TOPIC, //
"/" + Constant.TOPIC, //
"wc");
// 定義kafkaSpout如何解析數據,這裏是將kafka的producer send的數據放入到String
// 類型的str變量中輸出,這個str是StringSchema定義的變量名稱
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
// 設置spout
builder.setSpout("kafkaSpout", new KafkaSpout(spoutConfig));
// 設置bolt
builder.setBolt("WordSplitBolt", //
new WordSplitBolt()).//
shuffleGrouping("kafkaSpout");
// 設置bolt
builder.setBolt("WordCountBolt", //
new WordCountBolt()).//
fieldsGrouping("WordSplitBolt", new Fields("word"));
// 本地運行或者提交到集羣
if (args != null && args.length == 1) {
// 集羣運行
StormSubmitter.submitTopology(args[0], //
new Config(), //
builder.createTopology());
} else {
// 本地運行
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("local", //
new Config(),//
builder.createTopology());
// 這裏爲了測試方便就不shutdown了
Thread.sleep(10000000);
// cluster.shutdown();
}
}
}
WordSplitBolt
WordSplitBolt中首先獲取KafkaSpout中傳遞過來的message,然後將其根據空格分割成一個個單詞並emit出去:
/**
* project:storm-test
* file:WordBolt.java
* author:zxh
* time:2015年9月23日 下午2:29:12
* description:
*/
package cn.com.dimensoft.storm;
import backtype.storm.topology.BasicOutputCollector;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseBasicBolt;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Tuple;
import backtype.storm.tuple.Values;
/**
* class: WordSplitBolt
* package: cn.com.dimensoft.storm
* author:zxh
* time: 2015年9月23日 下午2:29:12
* description:
*/
public class WordSplitBolt extends BaseBasicBolt {
/**
* long:serialVersionUID
* description:
*/
private static final long serialVersionUID = -1904854284180350750L;
@Override
public void execute(Tuple input, BasicOutputCollector collector) {
// 根據變量名獲得從spout傳來的值,這裏的str是spout中定義好的變量名
String line = input.getStringByField("str");
// 對單詞進行分割
for (String word : line.split(" ")) {
// 傳遞給下一個組件,即WordCountBolt
collector.emit(new Values(word));
}
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
// 聲明本次emit出去的變量名稱
declarer.declare(new Fields("word"));
}
}
WordCountBolt
WordCountBolt獲得從WordSplitBolt中傳遞過來的單詞並統計詞頻
/**
* project:storm-test
* file:WordCountBolt.java
* author:zxh
* time:2015年9月23日 下午2:29:39
* description:
*/
package cn.com.dimensoft.storm;
import java.util.HashMap;
import java.util.Map;
import java.util.Map.Entry;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import backtype.storm.topology.BasicOutputCollector;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseBasicBolt;
import backtype.storm.tuple.Tuple;
/**
* class: WordCountBolt
* package: cn.com.dimensoft.storm
* author:zxh
* time: 2015年9月23日 下午2:29:39
* description:
*/
public class WordCountBolt extends BaseBasicBolt {
public Logger log = LoggerFactory.getLogger(WordCountBolt.class);
/**
* long:serialVersionUID
* description:
*/
private static final long serialVersionUID = 7683600247870291231L;
private static Map<String, Integer> map = new HashMap<String, Integer>();
@Override
public void execute(Tuple input, BasicOutputCollector collector) {
// 根據變量名稱獲得上一個bolt傳遞過來的數據
String word = input.getStringByField("word");
Integer count = map.get(word);
if (count == null) {
map.put(word, 1);
} else {
count ++;
map.put(word, count);
}
StringBuilder msg = new StringBuilder();
for(Entry<String, Integer> entry : map.entrySet()){
msg.append(entry.getKey() + " = " + entry.getValue()).append(", ");
}
log.info(msg.toString());
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
}
}
Constant
Constant常量類:
/**
* project:storm-kafka
* file:Constant.java
* author:zxh
* time:2015年10月8日 下午2:06:29
* description:
*/
package cn.com.dimensoft.constant;
/**
* class: Constant
* package: cn.com.dimensoft.constant
* author:zxh
* time: 2015年10月8日 下午2:06:29
* description:
*/
public class Constant {
// 定義topic名稱
public static final String TOPIC = "storm-kafka-test";
// zookeeper地址
public static final String ZOOKEEPER_STRING = "hadoop-main.dimensoft.com.cn:2181,"
+ "hadoop-slave1.dimensoft.com.cn:2181,"
+ "hadoop-slave2.dimensoft.com.cn:2181";
// broker在zookeeper中存儲位置
public static final String ZOOKEEPER_PTAH = "/kafka/brokers";
}
測試
創建topic
$ bin/kafka-topics.sh --create --zookeeper hadoop-main.dimensoft.com.cn:2181,hadoop-slave1.dimensoft.com.cn:2181,hadoop-slave2.dimensoft.com.cn:2181/kafka --partitions 2 --replication-factor 3 --topic storm-kafka-test
查看該topic信息
$ bin/kafka-topics.sh --describe --zookeeper hadoop-main.dimensoft.com.cn:2181,hadoop-slave1.dimensoft.com.cn:2181,hadoop-slave2.dimensoft.com.cn:2181/kafka --topic storm-kafka-test
//結果
Topic:storm-kafka-test PartitionCount:2 ReplicationFactor:3 Configs:
Topic: storm-kafka-test Partition: 0 Leader: 1 Replicas: 1,0,2 Isr: 1,0,2
Topic: storm-kafka-test Partition: 1 Leader: 2 Replicas: 2,1,0 Isr: 2,1,0
直接在eclipse中運行WordCountTopology類,這樣方便調試程序,運行完成之後使用kafka自帶的producer來向storm-kafka-test這個topic推送數據
bin/kafka-console-producer.sh --broker-list hadoop-main.dimensoft.com.cn:9092,hadoop-slave1.dimensoft.com.cn:9092,hadoop-slave2.dimensoft.com.cn:9092 --topic storm-kafka-test
將以下內容發送到kafka(每行數據回車一次)
hadoop is a good technology
hadoop and hbase
today is a good day
觀察eclipse控制檯輸出
223538 [Thread-17-WordCountBolt] INFO cn.com.dimensoft.storm.WordCountBolt - hadoop = 2, is = 1, technology = 1, hbase = 1, a = 1, today = 1, good = 1, and = 1,
223538 [Thread-17-WordCountBolt] INFO cn.com.dimensoft.storm.WordCountBolt - hadoop = 2, is = 2, technology = 1, hbase = 1, a = 1, today = 1, good = 1, and = 1,
223538 [Thread-17-WordCountBolt] INFO cn.com.dimensoft.storm.WordCountBolt - hadoop = 2, is = 2, technology = 1, hbase = 1, a = 2, today = 1, good = 1, and = 1,
223538 [Thread-17-WordCountBolt] INFO cn.com.dimensoft.storm.WordCountBolt - hadoop = 2, is = 2, technology = 1, hbase = 1, a = 2, today = 1, good = 2, and = 1,
223539 [Thread-17-WordCountBolt] INFO cn.com.dimensoft.storm.WordCountBolt - hadoop = 2, is = 2, technology = 1, hbase = 1, a = 2, today = 1, good = 2, day = 1, and = 1,
注意:topology在本地運行的時候並不會在ZK中存儲消費的storm-kafka-test offset,只有當將該topology提交到storm集羣時纔會在ZK中存儲其offset。所以當該topology掛掉的時候如果producer仍然在往storm-kafka-test中推送信息的話當topology重啓後這段時間所推送的信息就會丟失了。有興趣的話可以測試一下,然後可以測試將該topology提交到storm集羣,測試將topology kill掉,然後繼續使用producer推送數據,然後再啓動該topology,就會發現故障期間的信息在topology重啓會進行消費而不會丟失。
自定義producer
上面測試的時候是使用kafka自帶的producer,但是在業務場景中我們都是根據實際業務情況自定義自己的producer,其實跟上面的是相類似的,主體流程沒有變化,只是將原先kafka自帶的producer替換爲自定義的producer,自定義的producer將message推送到storm-kafka-test這個topic即可,這樣一旦有message推送過去的時候KafkaSpout就會接收到並進行處理。
SampleProducer
SampleProducer簡單的從控制檯讀取用戶輸入信息並推送到kafka集羣的storm-kafka-test中,然後storm通過訂閱storm-kafka-test這個topic來對message進行處理:
/**
* project:kafka-study
* file:SampleProducer.java
* author:zxh
* time:2015年9月25日 下午4:05:51
* description:
*/
package cn.com.dimensoft.kafka;
import java.util.Properties;
import java.util.Scanner;
import kafka.javaapi.producer.Producer;
import kafka.producer.KeyedMessage;
import kafka.producer.ProducerConfig;
import cn.com.dimensoft.constant.Constant;
/**
* class: SampleProducer
* package: cn.com.dimensoft.kafka
* author:zxh
* time: 2015年9月25日 下午4:05:51
* description:
* step1 : 創建存放配置信息的properties
* step2 : 將properties封裝到ProducerConfig中
* step3 : 創建producer對象
* step4 : 發送數據流
*/
public class SampleProducer {
@SuppressWarnings("resource")
public static void main(String[] args) throws InterruptedException {
// step1 : 創建存放配置信息的properties
Properties props = new Properties();
// 指定broker集羣
props.put("metadata.broker.list", //
"hadoop-main.dimensoft.com.cn:9092,"
+ "hadoop-slave1.dimensoft.com.cn:9092,"
+ "hadoop-slave2.dimensoft.com.cn:9092");
/**
* ack機制
* 0 which means that the producer never waits for an acknowledgement from the broker
* 1 which means that the producer gets an acknowledgement after the leader replica has received the data
* -1 The producer gets an acknowledgement after all in-sync replicas have received the data
*/
props.put("request.required.acks", "1");
// 消息發送類型 同步/異步
props.put("producer.type", "sync");
// 指定message序列化類,默認kafka.serializer.DefaultEncoder
props.put("serializer.class", "kafka.serializer.StringEncoder");
// 設置自定義的partition,當topic有多個partition時如何對message進行分區
props.put("partitioner.class", "cn.com.dimensoft.kafka.SamplePartition");
// step2 : 將properties封裝到ProducerConfig中
ProducerConfig config = new ProducerConfig(props);
// step3 : 創建producer對象
Producer<String, String> producer = new Producer<String, String>(config);
Scanner sc = new Scanner(System.in);
for (int i = 1; i <= 10; i++) {
// step4 : 發送數據流
// producer.send(new KeyedMessage<String, String>(Constant.TOPIC, //
// i + "", //
// String.valueOf("我是 " + i + " 號")));
Thread.sleep(1000);
producer.send(new KeyedMessage<String, String>(Constant.TOPIC, sc.next()));
}
}
}
SamplePartition
SamplePartition是自定義的partition,用來對消息進行分區:
/**
* project:kafka-study
* file:SamplePartition.java
* author:zxh
* time:2015年9月28日 下午5:37:19
* description:
*/
package cn.com.dimensoft.kafka;
import kafka.producer.Partitioner;
import kafka.utils.VerifiableProperties;
/**
* class: SamplePartition
* package: cn.com.dimensoft.kafka
* author:zxh
* time: 2015年9月28日 下午5:37:19
* description: 設置自定義的partition,指明當topic有多個partition時如何對message進行分區
*/
public class SamplePartition implements Partitioner {
/**
* constructor
* author:zxh
* @param verifiableProperties
* description: 去除該構造方法後啓動producer報錯NoSuchMethodException
*/
public SamplePartition(VerifiableProperties verifiableProperties) {
}
@Override
/**
* 這裏對message分區的依據只是簡單的讓key(這裏的key就是Producer[K,V]中的K)對partition的數量取模
*/
public int partition(Object obj, int partitions) {
// 對partitions數量取模
return Integer.parseInt(obj.toString()) % partitions;
}
}
測試的時候直接將WordCountTopology打包提交到storm集羣運行,打開storm的worker日誌,然後eclipse運行SampleProducer程序,通過從eclipse控制檯輸入數據來觀察storm的worker日誌輸出。
topology提交到storm集羣之後可以發現ZK中存儲了storm-kafka-test 被消費offset的znode,這樣即使topology故障重啓之後message也不會丟失
storm中work的日誌輸出