本文舉例介紹sparkstreaming寫入kerberos認證的hbase集羣的步驟。
1、引入相關依賴
現網環境相關組件版本清單:
- hadoop 3.1.4,
- spark 2.4.4,
- hbase 2.1.6
對應pom.xml的內容如下:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.qingzhongli</groupId>
<artifactId>kafka-to-hbase-sparkstreaming</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<hbase.version>2.1.6</hbase.version>
<hadoop.version>3.1.4</hadoop.version>
<spark.version>2.4.4</spark.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>${hbase.version}</version>
<exclusions>
<exclusion>
<artifactId>hadoop-common</artifactId>
<groupId>org.apache.hadoop</groupId>
</exclusion>
<exclusion>
<artifactId>hadoop-auth</artifactId>
<groupId>org.apache.hadoop</groupId>
</exclusion>
</exclusions>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
<exclusions>
<exclusion>
<artifactId>commons-cli</artifactId>
<groupId>commons-cli</groupId>
</exclusion>
</exclusions>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
<exclusions>
<exclusion>
<artifactId>hadoop-client</artifactId>
<groupId>org.apache.hadoop</groupId>
</exclusion>
</exclusions>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark- -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<!-- 將scala代碼打成jar包 -->
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<!-- 拷貝依賴的jar -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<executions>
<execution>
<id>copy-dependencies</id>
<phase>prepare-package</phase>
<goals>
<goal>copy-dependencies</goal>
</goals>
<configuration>
<!-- ${project.build.directory}爲Maven內置變量,缺省爲target -->
<outputDirectory>${project.build.directory}/dest/${project.artifactId}/lib</outputDirectory>
<!--
<overWriteReleases>false</overWriteReleases>
<overWriteSnapshots>false</overWriteSnapshots>
<overWriteIfNewer>true</overWriteIfNewer>
-->
<!-- 表示是否不包含間接依賴的包 -->
<excludeTransitive>false</excludeTransitive>
<!-- 表示複製的jar文件去掉版本信息 -->
<stripVersion>false</stripVersion>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<archive>
<manifest>
<!-- 告知 maven-jar-plugin添加一個 Class-Path元素到 MANIFEST.MF文件,以及在Class-Path元素中包括所有依賴項 -->
<addClasspath>false</addClasspath>
<classpathPrefix></classpathPrefix>
</manifest>
<!-- jar內的META-INF下不附帶項目的pom.xml -->
<addMavenDescriptor>false</addMavenDescriptor>
</archive>
<outputDirectory>${project.build.directory}/dest/${project.artifactId}/lib</outputDirectory>
</configuration>
</plugin>
</plugins>
</build>
<!-- 進入pom所在路徑,執行打包命令: mvn clean package -->
</project>
注意:spark-core_2.11本身有依賴hadoop-client,但依賴的是hadoop-client的2.6.5版本,與hadoop 3.1.4衝突,需要將其排除掉,否則spark作業運行會報錯。
2、準備相關配置文件
從現網集羣獲取hbase相關配置文件,包括:
- core-site.xml
- hbase-site.xml
- hdfs-site.xml
kerberos認證相關配置文件,包括:
- qingzhongli.keytab
在提交時通過--files選項將core-site.xml、hbase-site.xml、hdfs-site.xml、qingzhongli.keytab、jaas.conf共五個配置文件一併提交。在程序可以通過名字可直接獲取
val coreSiteIn = HBaseUtil.getClass.getClassLoader.getResourceAsStream("core-site.xml")
消費kafka的jaas.conf配置文件:
KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
storeKey=true
useTicketCache=true
serviceName="kafka"
keyTab="qingzhongli.keytab"
principal="[email protected]";
};
3、代碼實現
程序入口Main:
package com.qingzhongli.kafka2hbase.sparkstreaming
import com.qingzhongli.kafka2hbase.sparkstreaming.util.HBaseUtil2
import org.apache.commons.logging.LogFactory
import org.apache.hadoop.hbase.TableName
import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.util.Bytes
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.{CanCommitOffsets, HasOffsetRanges, KafkaUtils}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object Main {
val logger = LogFactory.getLog(getClass())
def main(args: Array[String]): Unit = {
val brokers = "192.168.37.100:9092,192.168.37.101:9092,192.168.37.102:9092"
val topicsSet = "topic1".split(",").toSet
val groupId = "group1"
val autoOffsetReset = "latest"
val keytabPrincipal = "[email protected]"
val keytabFilePath = "qingzhongli.keytab"
val ss = SparkSession
.builder()
.appName("kafka-to-hbase-sparkstreaming")
.getOrCreate()
val sc = ss.sparkContext
val ssc = new StreamingContext(sc, Seconds(10))
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
"auto.offset.reset" -> autoOffsetReset,
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> (true: java.lang.Boolean),
"security.protocol" -> "SASL_PLAINTEXT",
"sasl.kerberos.service.name" -> "kafka"
)
// check offset range
import collection.JavaConverters._
KafkaUtil.offsetCheck(kafkaParams.asJava, topicsSet.asJava)
try {
// create direct stream
val stream = KafkaUtils.createDirectStream[String, String](
ssc, PreferConsistent, Subscribe[String, String](topicsSet, kafkaParams)
)
//write hbase
writeHbase(keytabPrincipal, keytabFilePath, stream)
// store offset by kafka self
storekafkaOffset(stream)
} catch {
case ex: Exception => logger.info(ex.getMessage, ex)
}
ssc.start()
ssc.awaitTermination()
}
private def writeHbase(keytabPrincipal: String, keytabFilePath: String, stream: InputDStream[ConsumerRecord[String, String]]) = {
stream.map(_.value())
.foreachRDD(rdd => {
rdd.foreachPartition(partitionRecords => {
val connection = HBaseUtil.getHBaseConn(keytabPrincipal, keytabFilePath)
var table: Table = null
try {
val tableName = TableName.valueOf("qingzhongli:test");
table = connection.getTable(tableName)
partitionRecords.foreach(record => {
val cf = "cf1"
val rowKey = record(0)
val f1 = record(1)
val f2 = record(2)
val put: Put = new Put(Bytes.toBytes(rowKey))
//put.setDurability(Durability.SKIP_WAL)
put.addColumn(Bytes.toBytes(cf), Bytes.toBytes("f1"), Bytes.toBytes(f1))
put.addColumn(Bytes.toBytes(cf), Bytes.toBytes("f2"), Bytes.toBytes(f2))
table.put(put)
})
} catch {
case ex: Exception => logger.error(ex.getMessage, ex)
} finally {
if (table != null) {
table.close()
}
if (connection != null) {
connection.close()
}
}
})
})
}
private def storekafkaOffset(messages: InputDStream[ConsumerRecord[String, String]]) = {
messages.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
// some time later, after outputs have completed
messages.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
}
}
KafkaUtil:
package com.qingzhongli.kafka2hbase.sparkstreaming;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.clients.consumer.OffsetAndMetadata;
import org.apache.kafka.clients.consumer.OffsetResetStrategy;
import org.apache.kafka.common.PartitionInfo;
import org.apache.kafka.common.TopicPartition;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.*;
public class KafkaUtil {
private static final Logger LOG = LoggerFactory.getLogger(KafkaUtil.class);
public static void offsetCheck(Map<String, Object> kafkaParams, Set<String> topics) {
final OffsetResetStrategy offsetResetStrategy = OffsetResetStrategy.valueOf(kafkaParams.get(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG).toString().toUpperCase(Locale.ROOT));
if (OffsetResetStrategy.EARLIEST.equals(offsetResetStrategy) || OffsetResetStrategy.LATEST.equals(offsetResetStrategy)) {
LOG.info("Going to reset consumer offsets");
final KafkaConsumer<String, String> consumer = new KafkaConsumer<>(kafkaParams);
LOG.info("Fetching current state");
final List<TopicPartition> parts = new LinkedList<>();
final Map<TopicPartition, OffsetAndMetadata> currentCommited = new HashMap<>();
for (String topic : topics) {
List<PartitionInfo> info = consumer.partitionsFor(topic);
for (PartitionInfo i : info) {
final TopicPartition p = new TopicPartition(topic, i.partition());
final OffsetAndMetadata m = consumer.committed(p);
parts.add(p);
if (m != null) {
LOG.info("consumer[topic:{}-{}, offset:{}]", topic, i.partition(), m.offset());
}
currentCommited.put(p, m);
}
}
Map<TopicPartition, Long> begining = new HashMap<>();
Map<TopicPartition, Long> ending = new HashMap<>();
for (String topic : topics) {
List<PartitionInfo> list = consumer.partitionsFor(topic);
for (PartitionInfo pi : list) {
TopicPartition tp = new TopicPartition(pi.topic(), pi.partition());
List<TopicPartition> tpList = new ArrayList<>(1);
tpList.add(tp);
consumer.assign(tpList);
consumer.seekToBeginning(tpList);
Long offset = consumer.position(tp);
begining.put(tp, offset);
LOG.info("producer[topic:{}-{}, beginning offset: {}]", pi.topic(), pi.partition(), offset);
}
}
for (String topic : topics) {
List<PartitionInfo> list = consumer.partitionsFor(topic);
for (PartitionInfo pi : list) {
TopicPartition tp = new TopicPartition(pi.topic(), pi.partition());
List<TopicPartition> tpList = new ArrayList<>(1);
tpList.add(tp);
consumer.assign(tpList);
consumer.seekToEnd(tpList);
Long offset = consumer.position(tp);
ending.put(tp, offset);
LOG.info("producer[topic:{}-{}, ending offset: {}]", pi.topic(), pi.partition(), offset);
}
}
LOG.info("Finding what offsets need to be adjusted");
final Map<TopicPartition, OffsetAndMetadata> newCommit = new HashMap<>();
for (TopicPartition part : parts) {
final OffsetAndMetadata m = currentCommited.get(part);
final Long begin = begining.get(part);
final Long end = ending.get(part);
if (m == null || m.offset() < begin) {
LOG.info("Adjusting partition {}-{}; OffsetAndMeta={} Begining={} End={}", part.topic(), part.partition(), m, begin, end);
final OffsetAndMetadata newMeta;
if (OffsetResetStrategy.EARLIEST.equals(offsetResetStrategy)) {
newMeta = new OffsetAndMetadata(begin);
} else if (OffsetResetStrategy.LATEST.equals(offsetResetStrategy)) {
newMeta = new OffsetAndMetadata(end);
} else {
newMeta = null;
}
LOG.info("New offset to be {}", newMeta);
if (newMeta != null) {
newCommit.put(part, newMeta);
}
}
}
consumer.commitSync(newCommit);
consumer.close();
}
}
}
HbaseUtil:
package com.qingzhongli.kafka2hbase.sparkstreaming.util
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.{Connection, ConnectionFactory}
import org.apache.hadoop.security.UserGroupInformation
import java.security.PrivilegedAction
/**
* @author qingzhongli.com
*/
object HBaseUtil {
/**
*
* @param principal
* @param keytabPath
* @return
*/
def getHBaseConn(principal: String, keytabPath: String): Connection = {
val configuration = HBaseConfiguration.create
val coreSiteIn = HBaseUtil.getClass.getClassLoader.getResourceAsStream("core-site.xml")
val hdfsSiteIn = HBaseUtil.getClass.getClassLoader.getResourceAsStream("hdfs-site.xml")
val hbaseSiteIn = HBaseUtil.getClass.getClassLoader.getResourceAsStream("hbase-site.xml")
configuration.addResource(coreSiteIn)
configuration.addResource(hdfsSiteIn)
configuration.addResource(hbaseSiteIn)
UserGroupInformation.setConfiguration(configuration)
UserGroupInformation.loginUserFromKeytab(principal, keytabPath)
val loginUser = UserGroupInformation.getLoginUser
loginUser.doAs(new PrivilegedAction[Connection] {
override def run(): Connection = ConnectionFactory.createConnection(configuration)
})
}
}
4、程序提交
#!/bin/sh
APP_LIB='/kafka-to-hbase-sparkstreaming/lib'
for JAR in `ls $APP_LIB/*.jar |grep -v kafka-to-hbase-sparkstreaming`
do
LIBJARS=$JAR,$LIBJARS
done
APP_JAR=`ls $APP_LIB/*.jar | grep kafka-to-hbase-sparkstreaming`
# /dev/null 2>&1 &
spark-submit --class com.qingzhongli.kafka2hbase.sparkstreaming.Main \
--queue qingzhongli \
--master yarn \
--deploy-mode cluster \
--executor-memory 150G \
--num-executors 14 \
--executor-cores 30 \
--driver-cores 3 \
--driver-memory 8g \
--conf spark.scheduler.mode=FIFO \
--conf spark.executor.memoryOverhead=8192 \
--conf spark.yarn.maxAppAttempts=5 \
--conf spark.locality.wait=50 \
--conf spark.shuffle.consolidateFiles=true \
--conf spark.streaming.kafka.maxRatePerPartition=6000 \
--conf spark.network.timeout=300000 \
--conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=jaas.conf" \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=jaas.conf" \
--files ../conf/jaas.conf,../conf/qingzhongli.keytab,../conf/core-site.xml,../conf/hbase-site.xml,../conf/hdfs-site.xml,../conf/SystemConfig.properties \
--principal "[email protected]" \
--keytab /kafka-to-hbase-sparkstreaming/conf/qingzhongli-copy.keytab \
--jars $LIBJARS \
$APP_JAR
參考:
- Spark2Streaming讀Kerberos環境的Kafka並寫數據到HBase
- https://spark.apache.org/docs/2.4.4/running-on-yarn.html#important-notes
- https://www.cnblogs.com/itboys/p/9962840.html