一、Griffin簡介

數據質量模塊是大數據平臺中必不可少的一個功能組件，Apache Griffin（以下簡稱Griffin）是一個開源的大數據數據質量解決方案，

它支持批處理和流模式兩種數據質量檢測方式，可以從不同維度（比如離線任務執行完畢後檢查源端和目標端的數據數量是否一致、源表的數據空值數量等）

度量數據資產，從而提升數據的準確度、可信度。
在Griffin的架構中，主要分爲Define、Measure和Analyze三個部分，如下圖所示：

各部分的職責如下：

：主要負責定義數據質量統計的維度，比如數據質量統計的時間跨度、統計的目標（源端和目標端的數據數量是否一致，數據源裏某一字段的非空的數量、不重複值的數量、最大值、最小值、top5的值數量等）
：主要負責執行統計任務，生成統計結果
：主要負責保存與展示統計結果

基於以上功能，大數據平臺計劃引入Griffin作爲數據質量解決方案，實現數據一致性檢查、空值統計等功能。以下是安裝步驟總結：中文版Quick Start

二、安裝部署

2.1 依賴準備

JDK (1.8 or later versions)
MySQL(version 5.6及以上)
Hadoop (2.6.0 or later)
Hive (version 2.x)
Spark (version 2.2.1)
Livy（livy-0.5.0-incubating）
ElasticSearch (5.0 or later versions)
Scala(2.x or later versions)

依賴於 Ambari 安裝 Griffin ，所以目前來說只需要安裝ES，Scala，初始化Griffin元數據庫即可。

1、初始化

初始化操作具體請參考A pache Griffin Deployment Guide，Hadoop集羣、Hive安裝步驟省略。

在MySQL中創建數據庫quartz，然後執行Init_quartz_mysql_innodb.sql腳本初始化表信息。

create database quartz character set utf8; CREATE USER 'quartz'@'%' IDENTIFIED BY 'L1234567'; GRANT ALL PRIVILEGES ON . TO 'quartz'@'%'; FLUSH PRIVILEGES; use quartz;

2、Hadoop和Hive

在Hadoop服務器上創建/home/spark_conf目錄，並將Hive的配置文件hive-site.xml上傳到該目錄下：

3、Scala 安裝

下載安裝包：https://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.tgz

## 解壓到 /usr/scala/ 目錄下

export SCALA_HOME=/usr/scala/scala-2.11.8

export CLASSPATH=$SCALA_HOME/lib/

export PATH=$PATH:$SCALA_HOME/bin

## 設置環境變量

export SPARK_HOME=/usr/hdp/3.1.4.0-315/spark2

export LIVY_HOME=/usr/hdp/3.1.4.0-315/livy2/bin

export HADOOP_CONF_DIR=/usr/hdp/3.1.4.0-315/hadoop/conf

4、 ES的安裝與啓動

tar包下載鏈接：ElasticSearch5.5.2.tar

1.兩臺創建一個es專門的用戶（必須）

useradd es

passwd es

密碼 es

2.兩臺機器使用root用戶執行visudo命令然後爲es用戶添加權限

root ALL=(ALL) ALL

es ALL=(ALL) ALL

3.解壓 es到 /usr/ 目錄下

scp -r /usr/elasticsearch-5.5.2/ hdp02:/usr/

chown -R es /usr/elasticsearch-5.5.2/

斷開連接linux的工具，然後重新使用es用戶連接上兩臺linux服務器

4、修改配置文件

cd /usr/elasticsearch-5.5.2

mkdir data

mkdir log

rm -rf elasticsearch.yml

vim elasticsearch.yml

cd /usr/elasticsearch-5.5.2/config/

vim elasticsearch.yml

cluster.name: hdp_es

node.name: hdp01

path.data: /usr/elasticsearch-5.5.2/data

path.logs: /usr/elasticsearch-5.5.2/log

network.host: 10.168.138.188

http.port: 9200

discovery.zen.ping.unicast.hosts: ["hdp01", "hdp02"]

bootstrap.system_call_filter: false

bootstrap.memory_lock: false

http.cors.enabled: true

http.cors.allow-origin: "*"

hdp02

cluster.name: hdp_es

node.name: hdp02

path.data: /usr/elasticsearch-5.5.2/data

path.logs: /usr/elasticsearch-5.5.2/log

network.host: 10.174.96.212

http.port: 9200

discovery.zen.ping.unicast.hosts: ["hdp01", "hdp02"]

bootstrap.system_call_filter: false

bootstrap.memory_lock: false

http.cors.enabled: true

http.cors.allow-origin: "*"

後臺啓動，也可以先在前臺啓動看看報錯信息。

nohup /usr/elasticsearch-5.5.2/bin/elasticsearch 2>&1 &

報錯：max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

兩臺機器執行以下命令，注意每次啓動ES之前都要執行

sudo sysctl -w vm.max_map_count=262144

[es@hdp01 bin]$ sudo sysctl -w vm.max_map_count=262144

vm.max_map_count = 262144

[es@hdp01 bin]$ nohup /usr/elasticsearch-5.5.2/bin/elasticsearch 2>&1 &

訪問兩臺機器的 9200 出現以下面的信息則安裝成功。

5.在ES裏創建griffin索引

curl -H "Content-Type: application/json" -XPUT http://10.168.138.188:9200/griffin? -d '

{

"aliases": {},

"mappings": {

"accuracy": {

"properties": {

"name": {

"fields": {

"keyword": {

"ignore_above": 256,

"type": "keyword"

}

"type": "text"

"tmst": {

"type": "date"

}

"settings": {

"index": {

"number_of_replicas": "2",

"number_of_shards": "5"

}

出現下面紅色框框的信息則安裝成功。

2.2 源碼打包部署

在這裏我使用源碼編譯打包的方式來部署Griffin，Griffin的源碼地址是：https://github.com/apache/griffin.git，這裏我使用的源碼tag是griffin-0.6.0，

下載完成在idea中導入並展開源碼的結構圖如下：

Griffin的源碼結構很清晰，主要包括griffin-doc、measure、service和ui四個模塊，其中griffin-doc負責存放Griffin的文檔，measure負責與spark交互，執行統計任務，service使用spring boot作爲服務實現，負責給ui模塊提供交互所需的restful api，保存統計任務，展示統計結果。

1、service/src/main/resources/application.properties

說明 applicatoin.properties 將 SpringBoot 默認的 8080 設置爲 8090，因爲我們的Ambari 使用的也是 8080 端口

server.port=8090
spring.application.name=griffin_service

spring.datasource.url=jdbc:mysql://10.168.138.188:3306/quartz?useSSL=false
spring.datasource.username=quartz
spring.datasource.password=L1234567
spring.jpa.generate-ddl=true
spring.datasource.driver-class-name=com.mysql.jdbc.Driver
spring.jpa.show-sql=true
# Hive metastore
hive.metastore.uris=thrift://10.168.138.188:9083
hive.metastore.dbname=default
hive.hmshandler.retry.attempts=15
hive.hmshandler.retry.interval=2000ms
# Hive cache time
cache.evict.hive.fixedRate.in.milliseconds=900000
# Kafka schema registry
kafka.schema.registry.url=http://10.168.138.188:8081
# Update job instance state at regular intervals
jobInstance.fixedDelay.in.milliseconds=60000
# Expired time of job instance which is 7 days that is 604800000 milliseconds.Time unit only supports milliseconds
jobInstance.expired.milliseconds=604800000
# schedule predicate job every 5 minutes and repeat 12 times at most
#interval time unit s:second m:minute h:hour d:day,only support these four units
predicate.job.interval=5m
predicate.job.repeat.count=12
# external properties directory location
external.config.location=
# external BATCH or STREAMING env
external.env.location=
# login strategy ("default" or "ldap")
login.strategy=default
# ldap
ldap.url=ldap://hostname:port
[email protected]
ldap.searchBase=DC=org,DC=example
ldap.searchPattern=(sAMAccountName={0})
# hdfs default name
fs.defaultFS=
# elasticsearch
elasticsearch.host=http://10.168.138.188
elasticsearch.port=9200
elasticsearch.scheme=http
# elasticsearch.user = user
# elasticsearch.password = password
# livy
livy.uri=http://10.168.138.188:8998/batches
# yarn url
yarn.uri=http://10.168.138.188:8088
# griffin event listener
internal.event.listeners=GriffinJobEventHook

logging.file=logs/griffin-service.log

因爲我們將數據庫的驅動，修改 service/pom.xml 將此處的註釋放開（140行左右）

<groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>${mysql.java.version}</version> </dependency>

2、service/src/main/resources/quartz.properties

org.quartz.scheduler.instanceName=spring-boot-quartz
org.quartz.scheduler.instanceId=AUTO
org.quartz.threadPool.threadCount=5
org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
org.quartz.jobStore.useProperties=true
org.quartz.jobStore.misfireThreshold=60000
org.quartz.jobStore.tablePrefix=QRTZ_
org.quartz.jobStore.isClustered=true
org.quartz.jobStore.clusterCheckinInterval=20000

3、service/src/main/resources/sparkProperties.json

{
  "file": "hdfs:///griffin/griffin-measure.jar",
 "className": "org.apache.griffin.measure.Application",
 "queue": "default",
 "numExecutors": 2,
 "executorCores": 1,
 "driverMemory": "1g",
 "executorMemory": "1g",
 "conf": {
    "spark.yarn.dist.files": "hdfs:///home/spark_conf/hive-site.xml"
 },
 "files": [
  ]
}

4、service/src/main/resources/env/env_batch.json

{
  "spark": {
    "log.level": "WARN"
 },
 "sinks": [
    {
      "type": "CONSOLE",
 "config": {
        "max.log.lines": 10
 }
    },
 {
      "type": "HDFS",
 "config": {
        "path": "hdfs://10.168.138.188:8020/griffin/persist",
 "max.persist.lines": 10000,
 "max.lines.per.file": 10000
 }
    },
 {
      "type": "ELASTICSEARCH",
 "config": {
        "method": "post",
 "api": "http://10.168.138.188:9200/griffin/accuracy",
 "connection.timeout": "1m",
 "retry": 10
 }
    }
  ],
 "griffin.checkpoint": []
}

配置文件修改好後，在idea裏的terminal裏執行如下maven命令進行編譯打包：

`mvn -Dmaven.test.skip=true clean install`

問題：此處有常見的 Jar 包下載不下來的依賴包錯誤
Griffin編譯失敗，kafka-schema-registry-client-3.2.0.jar下載地址

下載地址：https://github.com/Xiwu1994/griffin-kafka-schema-registry-client

mvn install:install-file -DgroupId=io.confluent -DartifactId=kafka-schema-registry-client -Dversion=3.2.0

-Dpackaging=jar -Dfile=kafka-schema-registry-client-3.2.0.jar

SparkSQL2.2.1 依賴下載問題，下載tar包手動導入

https://archive.apache.org/dist/spark/

mvn install:install-file -DartifactId=spark-sql_2.11 -Dversion=2.2.1 -Dpackaging=jar

-DfileG:\software\Ambari安裝包\spark2.21\spark-2.2.1-bin-hadoop2.7\jars\spark-sql_2.11-2.2.1.jar

說明：下載不到的Jar 包可以按照這種方式處理。

5、修改 Jar包名稱將measure-0.6.0.jar這個jar上傳到HDFS的/griffin文件目錄裏

# Hadoop需要的路徑

hadoop fs -mkdir -p /griffin/persist

hadoop fs -mkdir /griffin/checkpoint

6、運行griffin-service.jar，啓動Griffin管理後臺

sysctl -w vm.max_map_count=262144

nohup java -jar griffin-service.jar>service.out 2>&1 &

查看日誌報錯信息

訪問Apache Griffin的默認UI(默認情況下，spring boot的端口是8080)，我們之前改爲了 8090 ，能訪問則表示安裝成功。

參考文檔：

Apache Griffin 入門指南：http://griffin.apache.org/docs/quickstart-cn.html

Apache Griffin入門指南：https://www.jianshu.com/p/9e4067b3e2dd

Apache Griffin 5.0 編譯安裝和使用：https://blog.csdn.net/github_39577257/article/details/90607081

Apache Griffin 安裝與簡介

一、Griffin簡介

二、安裝部署

2.1 依賴準備

1、初始化

2、Hadoop和Hive

3、Scala 安裝

4、 ES的安裝與啓動

2.2 源碼打包部署

Shell/Python中的用戶名獲取

Kettle實現 HDFS文件解析同步到SQLServer數據庫（ETL 包括：時間格式化、IP校驗、字段拼接）

Kettle轉換中SQL中的執行順序（使用阻塞數據直到步驟都完成 ===》控制轉換中的 SQL執行順序）

Kettle解析HDFS文件進行----字段拼接、字符的替換、IP校驗

FAILED:HiveAccessContorlException Permission denied: user[hive] does not havar[USER] privilege on

kettle根據時間戳增量的將數據從MySQL同步SQLServer（linux部署腳本啓動作業、config.properties 配置數據庫）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結