Spark streaming+kafka+logstash日誌分析

一、安裝Zookeeper

1.下載zookeeper

安裝包拷貝到指定目錄解壓修改名稱爲zookeeper,我的目錄是/home/metaq
tar -zxvf zookeeper-3.4.6.tar.gz
mv zookeeper-3.4.6 zookeeper

2.修改zookeeper配置文件

在其中一臺機器(bigdatasvr01)上,解壓縮zookeeper-3.3.4.tar.gz,修改配置文件conf/zoo.cfg,內容如下所示:
tickTime=2000  
dataDir=/home/metaq/zookeeper/data
dataLogDir=/home/metaq/zookeeper/log
clientPort=2181  
initLimit=5  
syncLimit=2  
server.1=bigdatasvr01:2888:3888  
server.2=bigdatasvr02:2888:3888  
server.3=bigdatasvr03:2888:3888  
將bigdatasvr01機器上的修改好配置後,將安裝文件遠程拷貝到另外兩臺zookeeper服務器上對應的目錄
cd /home/metaq
scp -r zookeeper metaq@ bigdatasvr02:/home/metaq
scp -r zookeeper metaq@ bigdatasvr03:/home/metaq
設置myid,在dataDir指定的路徑下面創建myid文件,裏面爲數字用來標識當前主機,在conf/zoo.cfg中配置的server.Num中的Num就爲當前服務器myid中的數字,如server.1=bigdatasvr01:2888:3888  則服務器bigdatasvr01中myid的文件中的內容就爲
[metaq@bigdatasvr01 data]$ echo "1" > /home/metaq/zookeeper/data/myid echo "1" > /home/metaq/zookeeper/data/myid
[metaq@bigdatasvr01 data]$ echo "2" > /home/metaq/zookeeper/data/myid echo "1" > /home/metaq/zookeeper/data/myid
[metaq@bigdatasvr01 data]$ echo "3" > /home/metaq/zookeeper/data/myid echo "1" > /home/metaq/zookeeper/data/myid

3.啓動Zookeeper

在ZooKeeper集羣的每個結點上,執行啓動ZooKeeper服務的腳本,如下所示:
[metaq@bigdatasvr01 zookeeper]$ bin/zkServer.sh start
[metaq@bigdatasvr02 zookeeper]$ bin/zkServer.sh start
[metaq@bigdatasvr03 zookeeper]$ bin/zkServer.sh start

4.驗證安裝

可以通過zookeeper的腳本來查看集羣狀態,集羣中各節點角色包括(leader,follower),如下所以每個節點的查詢結果
[metaq@bigdatasvr01 zookeeper]$ bin/zkServer.sh status
JMX enabled by default
Using config: /home/metaq/zookeeper/bin/../conf/zoo.cfg
Mode: follower
[metaq@bigdatasvr02 zookeeper]$ bin/zkServer.sh  status
JMX enabled by default
Using config: /home/metaq/zookeeper/bin/../conf/zoo.cfg
Mode: follower
[metaq@bigdatasvr03 zookeeper]$ bin/zkServer.sh  status
JMX enabled by default
Using config: /home/metaq/zookeeper/bin/../conf/zoo.cfg
Mode: leader
zookeeper遇到的問題,zookeeper集羣啓動時出現路由失敗問題需要將主機的/etc/hosts文件中配置集羣中其他節點服務器名和IP,防火牆未關閉也造成的造成相應問題。

二、安裝kafka

kafka下載地址:https://www.apache.org/dyn/closer.cgi?path=/kafka/0.8.1/kafka_2.10-0.8.1.tgz
分別在三臺服務器上安裝kafka:
tar zxvf kafka_2.10-0.8.1.tgz

1.修改配置文件

修改每臺服務器的config/server.properties 
broker.id:  唯一,填數字
host.name:唯一,填服務器IP zookeeper.connect=192.168.1.107:2181,192.168.1.108:2181,192.168.1.109:2181

2.啓動Kakfa

再在每臺機器上執行: bin/kafka-server-start.sh -daemon config/server.properties  

3.創建Topic

bin/kafka-topics.sh --create --zookeeper 192.168.1.107:2181,192.168.1.108:2181,192.168.1.109:2181 --replication-factor 1 --partitions 3 --topic mykafka_test

4.查看Topic

bin/kafka-topics.sh --list --zookeeper 192.168.1.107:2181,192.168.1.108:2181,192.168.1.109:2181

5.查看詳細信息

bin/kafka-topics.sh --describe --zookeeper 192.168.1.107:2181,192.168.1.108:2181,192.168.1.109:2181

6.發送消息

bin/kafka-console-producer.sh --broker-list 192.168.1.107:9092 --topic mykafka_test

7.接收消息

bin/kafka-console-consumer.sh --zookeeper 192.168.1.106:2181,192.168.1.108:2181,192.168.1.109:2181 --topic mykafka --from-beginning

三、安裝logshash

下載安裝包logstash-5.2.1.tar.gz

在指定服務器上安裝logstash:

tar -zxvf logstash-5.2.1.tar.gz

監聽Tomcat輸出到kafka

添加配置文件tomcat_log_to_kafka.conf

input {
  file {
    type=> "apache"
    path=> "/home/zeus/apache-tomcat-7.0.72/logs/*"
   exclude => ["*.gz","*.log","*.out"]
   sincedb_path => "/dev/null"
  }
}
filter {
  if [type]== "apache" {
    grok {
        match => {"message" =>"%{COMBINEDAPACHELOG}"}
    }
  }
}
output {
  kafka {
    topic_id => "logstash_topic"
    bootstrap_servers => "192.168.1.107:9092, 192.168.1.108:9092, 192.168.1.109:9092"
    codec => plain {
       format => "%{message}"
    }
  }
}

啓動:bin/logstash -f tomcat_log_to_kafka.conf  --config.reload.automatic

--config.reload.automatic 每次修改配置文件不需要停止並重新啓動logstash

其中grok是則正格式化log日誌

四、Spark stream讀取Kafka消息

代碼如下:

object ApacheLogAnalysis {
  val LOG_ENTRY_PATTERN = "^([\\d.]+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(\\w+) (\\S+) (\\S+)\" (\\d{3}) (\\S+)".r
  def main(args: Array[String]): Unit = {
    var masterUrl = "local[2]"
    if (args.length > 0) {
      masterUrl = args.apply(0)
    }
    val sparkConf = new SparkConf().setMaster(masterUrl).setAppName("ApacheLogAnalysis")
    val ssc = new StreamingContext(sparkConf,Seconds(5))
    //ssc.checkpoint(".") // 因爲使用到了updateStateByKey,所以必須要設置checkpoint
    //主題
    val topics =  Set{ResourcesUtil.getValue(Constants.KAFKA_TOPIC_NAME)}
    //kafka地址
    val brokerList = ResourcesUtil.getValue(Constants.KAFKA_HOST_PORT)
    val kafkaParams = Map[String, String](
      "metadata.broker.list" -> brokerList,
      "serializer.class" -> "kafka.serializer.StringEncoder"
    )
    //連接kafka 創建stream
    val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc,kafkaParams,topics)
    val events = kafkaStream.flatMap(line => {
      //正則解析apache log日誌
      //192.168.1.249 - - [23/Jun/2017:12:48:43 +0800] "POST /zeus/zeus_platform/user.rpc HTTP/1.1" 200 99
      val LOG_ENTRY_PATTERN(clientip,ident,auth,timestamp,verb,request,httpversion,response,bytes) = line._2
      val logEntryMap = mutable.Map.empty[String,String]
      logEntryMap("clientip") = clientip
      logEntryMap("ident") = ident
      logEntryMap("auth") = auth
      logEntryMap("timestamp") = timestamp
      logEntryMap("verb") = verb
      logEntryMap("request") = request
      logEntryMap("httpversion") = httpversion
      logEntryMap("response") = response
      logEntryMap("bytes") = bytes
      Some(logEntryMap)
    })
    events.print()
    val requestUrls = events.map(x => (x("request"),1L)).reduceByKey(_+_)
    requestUrls.foreachRDD(rdd => {
      rdd.foreachPartition(partitionOfRecords => {
        partitionOfRecords.foreach(pair => {
          val requestUrl = pair._1
          val clickCount = pair._2
          println(s"=================requestUrl count====================  clientip:${requestUrl} clickCount:${clickCount}.")
        })
      })
    })
    ssc.start()
    ssc.awaitTermination()
  }
}


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章