sparkstreaming寫入kerberos認證的hbase集羣

本文舉例介紹sparkstreaming寫入kerberos認證的hbase集羣的步驟。

1、引入相關依賴

現網環境相關組件版本清單：

hadoop 3.1.4，
spark 2.4.4，
hbase 2.1.6

對應pom.xml的內容如下：

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.qingzhongli</groupId>
    <artifactId>kafka-to-hbase-sparkstreaming</artifactId>
    <version>1.0-SNAPSHOT</version>
    <properties>
        <hbase.version>2.1.6</hbase.version>
        <hadoop.version>3.1.4</hadoop.version>
        <spark.version>2.4.4</spark.version>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>${hbase.version}</version>
            <exclusions>
                <exclusion>
                    <artifactId>hadoop-common</artifactId>
                    <groupId>org.apache.hadoop</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>hadoop-auth</artifactId>
                    <groupId>org.apache.hadoop</groupId>
                </exclusion>
            </exclusions>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>${hadoop.version}</version>
            <exclusions>
                <exclusion>
                    <artifactId>commons-cli</artifactId>
                    <groupId>commons-cli</groupId>
                </exclusion>
            </exclusions>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>${spark.version}</version>
            <exclusions>
                <exclusion>
                    <artifactId>hadoop-client</artifactId>
                    <groupId>org.apache.hadoop</groupId>
                </exclusion>
            </exclusions>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark- -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>
            <!-- 將scala代碼打成jar包 -->
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <!-- 拷貝依賴的jar -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-dependency-plugin</artifactId>
                <executions>
                    <execution>
                        <id>copy-dependencies</id>
                        <phase>prepare-package</phase>
                        <goals>
                            <goal>copy-dependencies</goal>
                        </goals>
                        <configuration>
                            <!-- ${project.build.directory}爲Maven內置變量，缺省爲target -->
                            <outputDirectory>${project.build.directory}/dest/${project.artifactId}/lib</outputDirectory>
                            <!--
                            <overWriteReleases>false</overWriteReleases>
                            <overWriteSnapshots>false</overWriteSnapshots>
                            <overWriteIfNewer>true</overWriteIfNewer>
                            -->
                            <!-- 表示是否不包含間接依賴的包  -->
                            <excludeTransitive>false</excludeTransitive>
                            <!-- 表示複製的jar文件去掉版本信息 -->
                            <stripVersion>false</stripVersion>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-jar-plugin</artifactId>
                <configuration>
                    <archive>
                        <manifest>
                            <!-- 告知 maven-jar-plugin添加一個 Class-Path元素到 MANIFEST.MF文件，以及在Class-Path元素中包括所有依賴項 -->
                            <addClasspath>false</addClasspath>
                            <classpathPrefix></classpathPrefix>
                        </manifest>
                        <!-- jar內的META-INF下不附帶項目的pom.xml -->
                        <addMavenDescriptor>false</addMavenDescriptor>
                    </archive>
                    <outputDirectory>${project.build.directory}/dest/${project.artifactId}/lib</outputDirectory>
                </configuration>
            </plugin>
        </plugins>
    </build>
    <!-- 進入pom所在路徑，執行打包命令： mvn clean package -->
</project>

注意：spark-core_2.11本身有依賴hadoop-client，但依賴的是hadoop-client的2.6.5版本，與hadoop 3.1.4衝突，需要將其排除掉，否則spark作業運行會報錯。

2、準備相關配置文件

從現網集羣獲取hbase相關配置文件，包括：

core-site.xml
hbase-site.xml
hdfs-site.xml

kerberos認證相關配置文件，包括：

qingzhongli.keytab

在提交時通過--files選項將core-site.xml、hbase-site.xml、hdfs-site.xml、qingzhongli.keytab、jaas.conf共五個配置文件一併提交。在程序可以通過名字可直接獲取

val coreSiteIn = HBaseUtil.getClass.getClassLoader.getResourceAsStream("core-site.xml")

消費kafka的jaas.conf配置文件：

KafkaClient {
      com.sun.security.auth.module.Krb5LoginModule required
      useKeyTab=true
      storeKey=true
      useTicketCache=true
      serviceName="kafka"
      keyTab="qingzhongli.keytab"
      principal="[email protected]";
  };

3、代碼實現

程序入口Main：

package com.qingzhongli.kafka2hbase.sparkstreaming

import com.qingzhongli.kafka2hbase.sparkstreaming.util.HBaseUtil2
import org.apache.commons.logging.LogFactory
import org.apache.hadoop.hbase.TableName
import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.util.Bytes
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.{CanCommitOffsets, HasOffsetRanges, KafkaUtils}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object Main {

  val logger = LogFactory.getLog(getClass())

  def main(args: Array[String]): Unit = {
    val brokers = "192.168.37.100:9092,192.168.37.101:9092,192.168.37.102:9092"
    val topicsSet = "topic1".split(",").toSet
    val groupId = "group1"
    val autoOffsetReset = "latest"
    val keytabPrincipal = "[email protected]"
    val keytabFilePath = "qingzhongli.keytab"
    val ss = SparkSession
      .builder()
      .appName("kafka-to-hbase-sparkstreaming")
      .getOrCreate()

    val sc = ss.sparkContext
    val ssc = new StreamingContext(sc, Seconds(10))

    val kafkaParams = Map[String, Object](
      ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      ConsumerConfig.GROUP_ID_CONFIG -> groupId,
      "auto.offset.reset" -> autoOffsetReset,
      ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> (true: java.lang.Boolean),
      "security.protocol" -> "SASL_PLAINTEXT",
      "sasl.kerberos.service.name" -> "kafka"
    )
    // check offset range
    import collection.JavaConverters._
    KafkaUtil.offsetCheck(kafkaParams.asJava, topicsSet.asJava)

    try {
      // create direct stream
      val stream = KafkaUtils.createDirectStream[String, String](
        ssc, PreferConsistent, Subscribe[String, String](topicsSet, kafkaParams)
      )

      //write hbase
      writeHbase(keytabPrincipal, keytabFilePath, stream)

      // store offset by kafka self
      storekafkaOffset(stream)
    } catch {
      case ex: Exception => logger.info(ex.getMessage, ex)
    }
    ssc.start()
    ssc.awaitTermination()
  }

  private def writeHbase(keytabPrincipal: String, keytabFilePath: String, stream: InputDStream[ConsumerRecord[String, String]]) = {
    stream.map(_.value())
      .foreachRDD(rdd => {
        rdd.foreachPartition(partitionRecords => {
          val connection = HBaseUtil.getHBaseConn(keytabPrincipal, keytabFilePath)
          var table: Table = null
          try {
            val tableName = TableName.valueOf("qingzhongli:test");
            table = connection.getTable(tableName)
            partitionRecords.foreach(record => {
              val cf = "cf1"
              val rowKey = record(0)
              val f1 = record(1)
              val f2 = record(2)
              val put: Put = new Put(Bytes.toBytes(rowKey))
              //put.setDurability(Durability.SKIP_WAL)
              put.addColumn(Bytes.toBytes(cf), Bytes.toBytes("f1"), Bytes.toBytes(f1))
              put.addColumn(Bytes.toBytes(cf), Bytes.toBytes("f2"), Bytes.toBytes(f2))
              table.put(put)
            })
          } catch {
            case ex: Exception => logger.error(ex.getMessage, ex)
          } finally {
            if (table != null) {
              table.close()
            }
            if (connection != null) {
              connection.close()
            }
          }
        })
      })
  }

  private def storekafkaOffset(messages: InputDStream[ConsumerRecord[String, String]]) = {
    messages.foreachRDD { rdd =>
      val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

      // some time later, after outputs have completed
      messages.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
    }
  }
}

KafkaUtil:

package com.qingzhongli.kafka2hbase.sparkstreaming;

import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.clients.consumer.OffsetAndMetadata;
import org.apache.kafka.clients.consumer.OffsetResetStrategy;
import org.apache.kafka.common.PartitionInfo;
import org.apache.kafka.common.TopicPartition;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.*;

public class KafkaUtil {

    private static final Logger LOG = LoggerFactory.getLogger(KafkaUtil.class);

    public static void offsetCheck(Map<String, Object> kafkaParams, Set<String> topics) {
        final OffsetResetStrategy offsetResetStrategy = OffsetResetStrategy.valueOf(kafkaParams.get(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG).toString().toUpperCase(Locale.ROOT));
        if (OffsetResetStrategy.EARLIEST.equals(offsetResetStrategy) || OffsetResetStrategy.LATEST.equals(offsetResetStrategy)) {
            LOG.info("Going to reset consumer offsets");
            final KafkaConsumer<String, String> consumer = new KafkaConsumer<>(kafkaParams);

            LOG.info("Fetching current state");
            final List<TopicPartition> parts = new LinkedList<>();
            final Map<TopicPartition, OffsetAndMetadata> currentCommited = new HashMap<>();
            for (String topic : topics) {
                List<PartitionInfo> info = consumer.partitionsFor(topic);
                for (PartitionInfo i : info) {
                    final TopicPartition p = new TopicPartition(topic, i.partition());
                    final OffsetAndMetadata m = consumer.committed(p);
                    parts.add(p);
                    if (m != null) {
                        LOG.info("consumer[topic:{}-{}, offset:{}]", topic, i.partition(), m.offset());
                    }
                    currentCommited.put(p, m);
                }
            }

            Map<TopicPartition, Long> begining = new HashMap<>();
            Map<TopicPartition, Long> ending = new HashMap<>();

            for (String topic : topics) {
                List<PartitionInfo> list = consumer.partitionsFor(topic);
                for (PartitionInfo pi : list) {
                    TopicPartition tp = new TopicPartition(pi.topic(), pi.partition());
                    List<TopicPartition> tpList = new ArrayList<>(1);
                    tpList.add(tp);
                    consumer.assign(tpList);
                    consumer.seekToBeginning(tpList);
                    Long offset = consumer.position(tp);
                    begining.put(tp, offset);
                    LOG.info("producer[topic:{}-{}, beginning offset: {}]", pi.topic(), pi.partition(), offset);
                }
            }

            for (String topic : topics) {
                List<PartitionInfo> list = consumer.partitionsFor(topic);
                for (PartitionInfo pi : list) {
                    TopicPartition tp = new TopicPartition(pi.topic(), pi.partition());
                    List<TopicPartition> tpList = new ArrayList<>(1);
                    tpList.add(tp);
                    consumer.assign(tpList);
                    consumer.seekToEnd(tpList);
                    Long offset = consumer.position(tp);
                    ending.put(tp, offset);
                    LOG.info("producer[topic:{}-{}, ending offset: {}]", pi.topic(), pi.partition(), offset);
                }
            }

            LOG.info("Finding what offsets need to be adjusted");
            final Map<TopicPartition, OffsetAndMetadata> newCommit = new HashMap<>();
            for (TopicPartition part : parts) {
                final OffsetAndMetadata m = currentCommited.get(part);
                final Long begin = begining.get(part);
                final Long end = ending.get(part);

                if (m == null || m.offset() < begin) {
                    LOG.info("Adjusting partition {}-{}; OffsetAndMeta={} Begining={} End={}", part.topic(), part.partition(), m, begin, end);

                    final OffsetAndMetadata newMeta;
                    if (OffsetResetStrategy.EARLIEST.equals(offsetResetStrategy)) {
                        newMeta = new OffsetAndMetadata(begin);
                    } else if (OffsetResetStrategy.LATEST.equals(offsetResetStrategy)) {
                        newMeta = new OffsetAndMetadata(end);
                    } else {
                        newMeta = null;
                    }

                    LOG.info("New offset to be {}", newMeta);
                    if (newMeta != null) {
                        newCommit.put(part, newMeta);
                    }
                }
            }

            consumer.commitSync(newCommit);
            consumer.close();
        }
    }
}

HbaseUtil：

package com.qingzhongli.kafka2hbase.sparkstreaming.util

import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.{Connection, ConnectionFactory}
import org.apache.hadoop.security.UserGroupInformation

import java.security.PrivilegedAction

/**
 * @author qingzhongli.com
 */
object HBaseUtil {

  /**
   *
   * @param principal
   * @param keytabPath
   * @return
   */
  def getHBaseConn(principal: String, keytabPath: String): Connection = {
    val configuration = HBaseConfiguration.create

    val coreSiteIn = HBaseUtil.getClass.getClassLoader.getResourceAsStream("core-site.xml")
    val hdfsSiteIn = HBaseUtil.getClass.getClassLoader.getResourceAsStream("hdfs-site.xml")
    val hbaseSiteIn = HBaseUtil.getClass.getClassLoader.getResourceAsStream("hbase-site.xml")

    configuration.addResource(coreSiteIn)
    configuration.addResource(hdfsSiteIn)
    configuration.addResource(hbaseSiteIn)

    UserGroupInformation.setConfiguration(configuration)
    UserGroupInformation.loginUserFromKeytab(principal, keytabPath)
    val loginUser = UserGroupInformation.getLoginUser
    loginUser.doAs(new PrivilegedAction[Connection] {
      override def run(): Connection = ConnectionFactory.createConnection(configuration)
    })
  }
}

4、程序提交

#!/bin/sh

APP_LIB='/kafka-to-hbase-sparkstreaming/lib'

for JAR in `ls $APP_LIB/*.jar |grep -v kafka-to-hbase-sparkstreaming`
do
    LIBJARS=$JAR,$LIBJARS
done

APP_JAR=`ls $APP_LIB/*.jar | grep kafka-to-hbase-sparkstreaming`

# /dev/null 2>&1 &

spark-submit --class com.qingzhongli.kafka2hbase.sparkstreaming.Main \
  --queue qingzhongli \
  --master yarn \
  --deploy-mode cluster \
  --executor-memory 150G \
  --num-executors  14 \
  --executor-cores 30 \
  --driver-cores 3 \
  --driver-memory 8g \
  --conf spark.scheduler.mode=FIFO \
  --conf spark.executor.memoryOverhead=8192 \
  --conf spark.yarn.maxAppAttempts=5 \
  --conf spark.locality.wait=50 \
  --conf spark.shuffle.consolidateFiles=true \
  --conf spark.streaming.kafka.maxRatePerPartition=6000 \
  --conf spark.network.timeout=300000 \
  --conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=jaas.conf" \
  --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=jaas.conf" \
  --files ../conf/jaas.conf,../conf/qingzhongli.keytab,../conf/core-site.xml,../conf/hbase-site.xml,../conf/hdfs-site.xml,../conf/SystemConfig.properties \
  --principal "[email protected]" \
  --keytab /kafka-to-hbase-sparkstreaming/conf/qingzhongli-copy.keytab \
  --jars $LIBJARS \
  $APP_JAR

參考：

sparkstreaming寫入kerberos認證的hbase集羣

釘釘打卡速度慢

Nginx R31 doc 官方文檔-01-nginx 如何安裝

Qt/C++音視頻開發74-合併標籤圖形/生成yolo運算結果圖形/文字和圖形合併成一個/水印濾鏡

挑戰程序設計競賽 2.2章習題 POJ - 3617 Best Cow Line 貪心

字節面試：MySQL什麼時候鎖表？如何防止鎖表？

.NET8連接SQL SERVER 2008 R2 報：證書鏈是由不受信任的頒發機構頒發的

golang開發環境搭建(win10)

python計算機視覺學習筆記——PIL庫的用法

Golang初學：獲取程序內存使用情況，std runtime

Shell/Python中的用戶名獲取

Linux中的tty和pts

網絡爬蟲的祕密：如何高效地抓取JD.com視頻鏈接

系統國際化之多語言解決方案| 京東物流技術團隊

CaffeineCache Api介紹以及與Guava Cache性能對比| 京東物流技術團隊

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結