1 版本要求
Spark版本:spark-2.3.0-bin-hadoop2.7
Phoenix版本:apache-phoenix-4.14.1-HBase-1.4-bin
HBASE版本:hbase-1.4.2
上面的版本必須是對應的,否則會報錯
2 Phoenix + HBase + Spark整合
A:安裝HBASE,這裏略,默認都會
B:Phoenix + HBASE整合,參考:https://blog.csdn.net/tototuzuoquan/article/details/81506285,要注意的是支持的版本組合是:apache-phoenix-4.14.1-HBase-1.4-bin + spark-2.3.0-bin-hadoop2.7 + hadoop-3.0.1,低版本的phoenix會導致Spark + Phoenix程序執行報錯。可以到官網下載2018年11月中旬更新的apache-phoenix-4.14.1-HBase-1.4-bin 版本。
C:Phoenix + Spark整合的過程中別忘記在$SPARK_HOME/conf/spark-defaults.conf 中增加以下代碼:
spark.driver.extraClassPath /data/installed/apache-phoenix-4.14.1-HBase-1.4-bin/phoenix-spark-4.14.1-HBase-1.4.jar:/data/installed/apache-phoenix-4.14.1-HBase-1.4-bin/phoenix-4.14.1-HBase-1.4-client.jar
spark.executor.extraClassPath /data/installed/apache-phoenix-4.14.1-HBase-1.4-bin/phoenix-spark-4.14.1-HBase-1.4.jar:/data/installed/apache-phoenix-4.14.1-HBase-1.4-bin/phoenix-4.14.1-HBase-1.4-client.jar
如果不加上面的配置,在Spark + Phoenix項目執行過程中將報錯。
3 Spark + Phoenix項目
3.1 工程結構
3.2 配置pom文件
artifactId>joda-time</artifactId>
<version>${joda-time.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala-library.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark-core_2.11.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark-streaming_2.11.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>${hadoop-hdfs.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop-client.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop-common.version}</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>${fastjson.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark-sql_2.11.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>${spark-hive_2.11.version}</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>${mysql.version}</version>
</dependency>
<!-- 直接操作hbase的時候,需要添加下面的依賴的jar包-->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>com.lmax</groupId>
<artifactId>disruptor</artifactId>
<version>3.3.6</version>
</dependency>
<dependency>
<groupId>org.apache.phoenix</groupId>
<artifactId>phoenix-core</artifactId>
<version>${phoenix.version}</version>
</dependency>
<dependency>
<groupId>org.apache.phoenix</groupId>
<artifactId>phoenix-spark</artifactId>
<version>${phoenix.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>${junit.version}</version>
<!--<scope>test</scope>-->
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.0</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<!--<arg>-make:transitive</arg>-->
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.18.1</version>
<configuration>
<useFile>false</useFile>
<disableXmlReport>true</disableXmlReport>
<includes>
<include>**/*Test.*</include>
<include>**/*Suite.*</include>
</includes>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>xxx.xxx.bigdata.xxxx.member.MemberGenerator</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
其中下面的jar包依賴是整合phoenix的關鍵:
<dependency>
<groupId>org.apache.phoenix</groupId>
<artifactId>phoenix-core</artifactId>
<version>${phoenix.version}</version>
</dependency>
<dependency>
<groupId>org.apache.phoenix</groupId>
<artifactId>phoenix-spark</artifactId>
<version>${phoenix.version}</version>
<scope>provided</scope>
</dependency>
3.3 創建phoenix的表
DROP TABLE DATA_CENTER_MEMBER;
CREATE TABLE IF NOT EXISTS DATA_CENTER_MEMBER(
PK VARCHAR PRIMARY KEY,
AREA_CODE VARCHAR(10),
AREA_NAME VARCHAR(30),
AGENT_ID VARCHAR(30),
AGENT_NAME VARCHAR(20),
SHOP_ID VARCHAR(40),
USER_ID VARCHAR(40),
BUYBACK_CYCLE DECIMAL(15,4),
PURCHASE_NUM BIGINT,
FIRST_PAY_TIME BIGINT,
LAST_PAY_TIME BIGINT,
UNPURCHASE_TIME BIGINT,
COMMON TINYINT,
STORED TINYINT,
POPULARIZE TINYINT,
VERSION_DATE INTEGER,
ADD_TIME BIGINT,
CREATE_DATE INTEGER
) COMPRESSION='GZ',DATA_BLOCK_ENCODING=NONE SPLIT ON ('0|','1|','2|','3|','4|','5|','6|','7|','8|','9|','10|','11|','12|','13|','14|', '15|','16|','17|','18|','19|','20|','21|','22|','23|','24|','25|','26|','27|','28|','29|','30|','31|','32|','33|','34|','35|','36|','37|','38|','39|', '40|','41|','42|','43|','44|','45|','46|','47|','48|','49|');
--使用覆蓋索引
CREATE index idx_version_date on DATA_CENTER_MEMBER(VERSION_DATE) include(PK,LAST_PAY_TIME,FIRST_PAY_TIME,PURCHASE_NUM);
其中rowkey的格式如下:
PK 爲 AREA_CODE%50|AREA_CODE|AGENT_ID|ADD_TIME|SHOP_ID|USER_ID (預分區數量爲50個)
在查詢的時候最好都是基於rowkey進行查詢,這個rowkey查詢的速度基本是毫秒級別的出來。
3.4 Spark Scala + Phoenix的代碼
package xxx.xxx.bigdata.xxxx.member
import java.util.Date
import xxx.xxx.bigdata.common.utils.DateUtils
import org.apache.spark.sql.SparkSession
import scala.collection.mutable.ListBuffer
object MemberGenerator {
/**
* 如果有參數,直接返回參數中的值,如果沒有默認是前一天的時間
* @param args :系統運行參數
* @param pattern :時間格式
* @return
*/
def gainDayByArgsOrSysCreate(args: Array[String],pattern: String):String = {
//如果有參數,直接返回參數中的值,如果沒有默認是前一天的時間
if(args.length > 0) {
args(0)
} else {
val previousDay = DateUtils.addOrMinusDay(new Date(), -1);
DateUtils.dateFormat(previousDay, "yyyy-MM-dd");
}
}
/**
* 處理會員數據
* @param args :傳遞的參數,args(0):處理哪天的數據 args(1):phoenix的zookeeper鏈接相關
*/
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("MemberGenerator")
//.master("local[*]")
.master("spark://bigdata1:7077")
.config("spark.sql.warehouse.dir", "/user/hive/warehouse")
//driver進程使用的內存數
.config("spark.driver.memory","2g")
//每個executor進程使用的內存數。和JVM內存串擁有相同的格式(如512m,2g)
.config("spark.executor.memory","2g")
.enableHiveSupport()
.getOrCreate();
val previousDayStr = gainDayByArgsOrSysCreate(args,"yyyy-MM-dd")
//昨天時間,格式:yyyyMMdd - 1 (初始化的時候,數據是昨天的數據。處理之後,變成處理時的時間)
val nowDate = DateUtils.getNowDateSimple().toLong - 1
// val df1 = spark.read.json("/xxxx/data-center/member/" + previousDayStr + "/member.json");
val df1 = spark.read.json(args(0));
spark.sql("use data_center");
//只有在有數據的時候才執行下面的操作
if(df1.count() > 0) {
df1.createOrReplaceTempView("member_temp")
//小表在前面提高數據執行效率
val df2 = spark.sql(
"SELECT " +
" ts.areacode as AREA_CODE, " +
" ts.areaname as AREA_NAME, " +
" ts.agentid as AGENT_ID, " +
" ts.agentname as AGENT_NAME, " +
" mt.shopId as SHOP_ID, " +
" mt.userId as USER_ID, " +
" mt.commonMember as COMMON, " +
" mt.storedMember as STORED, " +
" mt.popularizeMember as POPULARIZE," +
" mt.addTime as ADD_TIME " +
"FROM " +
" tb_shop ts," +
" member_temp mt " +
"WHERE " +
" ts.shopId = mt.shopId ")
df2.createOrReplaceTempView("member_temp2")
//df2.show()
//注意,要通過Spark修改數據,需要先執行下面的load操作,然後再執行下面的saveToPhoenix的操作
val df = spark.read
.format("org.apache.phoenix.spark")
.options(
Map("table" -> "DATA_CENTER_MEMBER",
"zkUrl" -> args(1))
//"zkUrl" -> "jdbc:phoenix:bigdata3:2181")
).load
df.show()
var list = new ListBuffer[(Any,Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any,Any, Any,Any,Any)]
df4.collect().foreach(
x => {
//獲取成爲會員的時間,格式:yyyyMMdd 爲整型數據
val CREATE_DATE = Integer.parseInt(DateUtils.getLongToString(x.get(10).asInstanceOf[Long] * 1000,DateUtils.PATTERN_DATE_SIMPLE))
//println(">>>>>>>>>>" + x.get(5).toString + " " + x.get(6).toString + " " + x.get(9))
// PK 爲 AREA_CODE%50|AREA_CODE|AGENT_ID|ADD_TIME|SHOP_ID|USER_ID (預分區數量爲50個)
val PK = (x.get(0).toString.toInt % 20) + "|" + x.get(4).toString + "|" + x.get(5).toString
val temp = (
PK,
x.get(0),
x.get(1),
x.get(2),
x.get(3),
x.get(4),
x.get(5),
x.get(6),
x.get(7),
x.get(8),
x.get(9),
x.get(10),
x.get(11),
x.get(12),
nowDate,
CREATE_DATE
)
list.+=(temp)
//每當有10000條記錄之後,保存或更新數據
if (list.length / 10000 == 1) {
sc.parallelize(list)
.saveToPhoenix(
"MEMBER",
Seq("PK","AREA_CODE", "AREA_NAME", "AGENT_ID", "AGENT_NAME", "SHOP_ID", "USER_ID", "COMMON", "STORED", "POPULARIZE", "PURCHASE_NUM" , "ADD_TIME", "FIRST_PAY_TIME", "LAST_PAY_TIME","VERSION_DATE","CREATE_DATE"),
zkUrl = Some(args(1)))
//執行完成之後,將這個list清空
list.clear()
//這裏指休眠n毫秒
println("休眠200毫秒")
Thread.sleep(200)
}
})
//如果最後list中的結果不爲空
if(!list.isEmpty) {
sc.parallelize(list)
.saveToPhoenix(
"DATA_CENTER_MEMBER",
Seq("PK","AREA_CODE", "AREA_NAME", "AGENT_ID", "AGENT_NAME", "SHOP_ID", "USER_ID", "COMMON", "STORED", "POPULARIZE", "PURCHASE_NUM" , "ADD_TIME", "FIRST_PAY_TIME", "LAST_PAY_TIME","VERSION_DATE","CREATE_DATE"),
zkUrl = Some(args(1)))
list.clear()
}
}
spark.stop()
//程序正常退出
System.exit(0)
}
}
4 Spring Cloud + Phoenix 代碼
這裏不列Spring Cloud配置的代碼,只寫涉及到Phoenix相關的代碼。
4.1 工程結構
4.2 pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>xxx.xxx.bigdata.xxxx.datacenter</groupId>
<artifactId>project-pom</artifactId>
<version>1.0.1-SNAPSHOT</version>
<relativePath>../project-pom/pom.xml</relativePath>
</parent>
<artifactId>member</artifactId>
<packaging>jar</packaging>
<name>member</name>
<description>會員數據模塊</description>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
<java.version>1.8</java.version>
</properties>
<dependencies>
<dependency>
<groupId>xxx.xxx.frame3</groupId>
<artifactId>common-utils</artifactId>
<version>${youx-frame3-common.version}</version>
</dependency>
<dependency>
<groupId>xxx.xxx.frame3</groupId>
<artifactId>common-dao</artifactId>
<version>${youx-frame3-common.version}</version>
</dependency>
<dependency>
<groupId>xxx.xxx.frame3</groupId>
<artifactId>common-service</artifactId>
<version>${youx-frame3-common.version}</version>
</dependency>
<dependency>
<groupId>xxx.xxx.frame3</groupId>
<artifactId>common-web</artifactId>
<version>${youx-frame3-common.version}</version>
</dependency>
<!-- 查詢kylin jdbc driver -->
<!--<dependency>
<groupId>org.apache.kylin</groupId>
<artifactId>kylin-jdbc</artifactId>
<version>${kylin-jdbc.version}</version>
</dependency>-->
<!-- 鳳凰hbase插件 -->
<dependency>
<groupId>org.apache.phoenix</groupId>
<artifactId>phoenix-core</artifactId>
<version>${phoenix-core.version}</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
<exclusion>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
</exclusion>
</exclusions>
</dependency>
<!-- 添加熱部署配置 -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-devtools</artifactId>
<optional>true</optional>
</dependency>
</dependencies>
<!-- 如果不加下面的插件,打的包不能直接通過jar -jar xxx.jar的方式運行 -->
<build>
<plugins>
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
</plugin>
<!-- 打包jar文件時,配置manifest文件,加入lib包的jar依賴 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<classesDirectory>${project.build.directory}/classes</classesDirectory>
<archive>
<manifest>
<mainClass>xxx.xxx.bigdata.xxxx.datacenter.member.MemberApplication</mainClass>
<!-- 打包時 MANIFEST.MF文件不記錄的時間戳版本 -->
<useUniqueVersions>false</useUniqueVersions>
<addClasspath>true</addClasspath>
<classpathPrefix>lib/</classpathPrefix>
</manifest>
<manifestEntries>
<Class-Path></Class-Path>
</manifestEntries>
</archive>
<excludes>
<!-- 指定打包時要排除的文件,支持正則 -->
<exclude>bootstrap.properties</exclude>
</excludes>
</configuration>
</plugin>
<!-- 把依賴的jar包,打成一個lib文件夾 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<executions>
<execution>
<id>copy-dependencies</id>
<phase>package</phase>
<goals>
<goal>copy-dependencies</goal>
</goals>
<configuration>
<type>jar</type>
<includeTypes>jar</includeTypes>
<outputDirectory>
${project.build.directory}/lib
</outputDirectory>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
4.3 bootstrap.properties 的內容
server.port=6022
spring.application.name=member
#===================統一配置服務註冊中心(Spring Cloud Eureka)=============================
eureka.client.service-url.defaultZone=http://127.0.0.1:5000/eureka
#===================統一配置開啓Feign(Spring Cloud Feign)=================================
#開啓Feign中的斷路器功能
feign.hystrix.enabled=true
#===================統一配置管理客戶端配置(Spring Cloud Config)===========================
#高可用配置中心通過服務Id去自動發現config-server服務組
spring.cloud.config.discovery.enabled=true
spring.cloud.config.discovery.service-id=config-server
# 配置規則爲{spring.cloud.config.name}-{spring.cloud.config.profile}.properties,指定配置文件前綴(所有項目的配置文件前綴必須以:application-dev開頭)
spring.cloud.config.name=application
spring.cloud.config.profile=dev-sfsadas
# 分支配置
spring.cloud.config.label=trunk
# 配置服務端的地址,高可用模式下配置了serivice-id,所以這裏就不用指定uri了
#spring.cloud.config.uri=http://127.0.0.1:5002/
#=============配置消息總線所需的參數配置(Spring Cloud Bus)=================================
#Spring Cloud Bus將分佈式的節點用輕量的消息代理連接起來。它可以用於廣播文件的更改或者服務之間的通訊
#也可以用於監聽。
spring.rabbitmq.host=xxx.xxx.xxx.xxx
spring.rabbitmq.port=5672
spring.rabbitmq.username=guest
spring.rabbitmq.password=guest
spring.cloud.bus.enabled=true
spring.cloud.bus.trace.enabled=true
management.endpoints.web.exposure.include=bus-refresh
#===================統一配置服務鏈路追蹤(Spring Cloud Sleuth)=============================
spring.sleuth.web.client.enabled=true
#將採樣比例設置爲1.0,也就是全部都需要,默認是0.1
spring.sleuth.sampler.probability=1.0
spring.zipkin.base-url=http://localhost:9411
#===================統一配置數據庫連接=============================
spring.datasource.name=member
#使用druid數據源
spring.datasource.type=com.alibaba.druid.pool.DruidDataSource
#數據源連接url
spring.datasource.url=jdbc:phoenix:xxx.xxx.xxx.xxx:2181
spring.datasource.username=root
spring.datasource.password=123456
spring.datasource.driver-class-name=org.apache.phoenix.jdbc.PhoenixDriver
spring.datasource.filters=stat
spring.datasource.maxActive=60
spring.datasource.initialSize=10
spring.datasource.maxWait=60000
spring.datasource.minIdle=10
spring.datasource.timeBetweenEvictionRunsMillis=60000
spring.datasource.minEvictableIdleTimeMillis=300000
spring.datasource.validationQuery=SELECT 1
spring.datasource.testWhileIdle=true
spring.datasource.testOnBorrow=false
spring.datasource.testOnReturn=false
spring.datasource.poolPreparedStatements=true
spring.datasource.maxOpenPreparedStatements=10
#配置mybaits相關參數:https://www.cnblogs.com/lianggp/p/7573653.html
#MyBatis Mapper 所對應的 XML 文件位置,如果您在 Mapper 中有自定義方法(XML 中有自定義實現)
# ,需要進行該配置,告訴 Mapper 所對應的 XML 文件位置
mybatis-plus.mapper-locations=classpath*:/mapper/*.xml
# MyBaits 別名包掃描路徑,通過該屬性可以給包中的類註冊別名,註冊後在 Mapper 對應的 XML
# 文件中可以直接使用類名,而不用使用全限定的類名(即 XML 中調用的時候不用包含包名)。
mybatis-plus.type-aliases-package=xxx.xxx.bigdata.xxx.datacenter.member.entity
#啓動時是否檢查MyBaits XML文件的存在,默認不檢查
mybatis-plus.check-config-location=false
# 數據庫類型,默認值爲未知的數據庫類型 如果值爲OTHER,啓動時會根據數據庫連接url獲取數據庫類型;如果不是OTHER則不會自動獲取數據庫類型
mybatis-plus.global-config.db-config.db-type=mysql
# 通過下面的方式打印運行的sql
mybatis-plus.configuration.log-impl=org.apache.ibatis.logging.stdout.StdOutImpl
#使得返回的結果默認是駝峯格式
#mybatis-plus.configuration.map-underscore-to-camel-case=true
#pagehelper分頁插件
pagehelper.helper-dialect=mysql
pagehelper.reasonable=true
pagehelper.support-methods-arguments=true
pagehelper.params=count=countSql
上面的關鍵點是數據庫連接池配置
4.4 MemberApplication的內容
package xxx.xxx.bigdata.xxxx.datacenter.member;
import org.mybatis.spring.annotation.MapperScan;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.EnableAutoConfiguration;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.cloud.client.discovery.EnableDiscoveryClient;
import org.springframework.cloud.context.config.annotation.RefreshScope;
import org.springframework.cloud.netflix.eureka.EnableEurekaClient;
import org.springframework.cloud.netflix.hystrix.EnableHystrix;
import org.springframework.cloud.openfeign.EnableFeignClients;
@SpringBootApplication
@MapperScan(“xxx.xxx.bigdata.xxxx.datacenter.member.mapper”)
@EnableAutoConfiguration
@EnableEurekaClient
@EnableDiscoveryClient
@EnableFeignClients
@EnableHystrix
@RefreshScope
public class MemberApplication {
public static void main(String[] args) {
SpringApplication.run(MemberApplication.class, args);
}
}
4.5 其它代碼
項目中的其它代碼略,那些和常規的寫法類似。
注意的是Phoenix不支持事務,所以@Transaction這樣的註解不要寫了,否則會報錯。