安裝配置jdk、hosts、ssh
安裝配置jdk
上傳tar.gz包
[root@JD /]# cd /usr/java
[root@JD java]# ll
total 355840
drwxr-xr-x 8 root root 4096 Nov 17 00:24 jdk1.8.0_121
-rw-r--r-- 1 root root 191100510 Nov 14 01:25 jdk1.8.0_121.zip
-rw-r--r-- 1 root root 173271626 Nov 28 14:13 jdk-8u45-linux-x64.gz
[root@JD java]# pwd
/usr/java
解壓
[root@JD java]# tar -zxvf jdk-8u45-linux-x64.gz
必須修正jdk的用戶和用戶組權限
[root@JD java]# ll
total 355844
drwxr-xr-x 8 10 143 4096 Apr 11 2015 jdk1.8.0_45
-rw-r--r-- 1 root root 173271626 Nov 28 14:13 jdk-8u45-linux-x64.gz
修正用戶和用戶組權限
[root@JD java]# chown -R root:root jdk1.8.0_45
[root@JD java]# ll
total 355844
drwxr-xr-x 8 root root 4096 Apr 11 2015 jdk1.8.0_45
-rw-r--r-- 1 root root 173271626 Nov 28 14:13 jdk-8u45-linux-x64.gz
全局配置環境變量
export JAVA_HOME=/usr/java/jdk1.8.0_45
export PATH=$JAVA_HOME/bin:$PATH
使環境變量生效
[root@JD java]# source /etc/profile
查看是否生效
[root@JD java]# which java
/usr/java/jdk1.8.0_45/bin/java
配置hosts
切記,前兩行系統自帶的千萬不要刪
[root@JD java]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.0.3 JD
配置ssh
使用root用戶配置ssh時,不需要給authorized_keys賦權限;使用非root用戶配置ssh時,必須要給authorized_keys賦0600權限,否則配置的ssh不生效
創建用戶
[root@JD /]# useradd hadoop
[root@JD /]# id hadoop
uid=1002(hadoop) gid=1003(hadoop) groups=1003(hadoop)
切換用戶
[root@JD ~]# su - hadoop
Last failed login: Thu Nov 28 12:17:14 CST 2019 from 40.76.65.78 on ssh:notty
There were 114 failed login attempts since the last successful login.
[hadoop@JD ~]$ pwd
/home/hadoop
生成公私鑰
[hadoop@JD ~]$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:saXNfI0OAlf988yUl7lv0mwHjOaOzFpIdUjo6KOyXz8 hadoop@JD
The key's randomart image is:
+---[RSA 2048]----+
| .o. |
| .o .. |
| .oo + .. +|
| .o.X . oo+o|
| . S = oo.*o|
| o. o +o o.+|
| .... .o. +.|
| . .. .E+ .. . B|
| .+. oo+.. +.|
+----[SHA256]-----+
查看隱藏文件
[hadoop@JD ~]$ ll -a
total 12
drwx------ 3 hadoop hadoop 70 Nov 28 15:30 .
drwxr-xr-x. 5 root root 43 Nov 28 15:08 ..
-rw-r--r-- 1 hadoop hadoop 18 Apr 11 2018 .bash_logout
-rw-r--r-- 1 hadoop hadoop 193 Apr 11 2018 .bash_profile
-rw-r--r-- 1 hadoop hadoop 231 Apr 11 2018 .bashrc
drwx------ 2 hadoop hadoop 36 Nov 28 15:30 .ssh
進入.ssh文件
[hadoop@JD ~]$ cd .ssh
查看生成的公私鑰
[hadoop@JD .ssh]$ ll
total 8
-rw------- 1 hadoop hadoop 1675 Nov 28 15:30 id_rsa
-rw-r--r-- 1 hadoop hadoop 391 Nov 28 15:30 id_rsa.pub
將公鑰放置到對方機器的用戶目錄下
[hadoop@JD .ssh]$ cat id_rsa.pub >> authorized_keys
[hadoop@JD .ssh]$ ll
total 12
-rw-rw-r-- 1 hadoop hadoop 391 Nov 28 15:32 authorized_keys
-rw------- 1 hadoop hadoop 1675 Nov 28 15:30 id_rsa
-rw-r--r-- 1 hadoop hadoop 391 Nov 28 15:30 id_rsa.pub
測試ssh是否生效,我們看到還是需要輸入密碼,也就沒生效,因爲是非root用戶,需要給authorized_keys賦0600權限
[hadoop@JD .ssh]$ ssh JD date
The authenticity of host 'jd (192.168.0.3)' can't be established.
ECDSA key fingerprint is SHA256:OLqoaMxlGFbCq4sC9pYgF+FdbcXHbEbtSrnMiGGFbVw.
ECDSA key fingerprint is MD5:d3:5b:4a:ef:8e:00:41:a0:5e:80:ef:75:76:8a:a3:49.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'jd,192.168.0.3' (ECDSA) to the list of known hosts.
hadoop@jd's password:
Permission denied, please try again.
hadoop@jd's password:
Permission denied, please try again.
hadoop@jd's password:
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
賦0600權限
[hadoop@JD .ssh]$ chmod 0600 authorized_keys
測試ssh,成功
[hadoop@JD .ssh]$ ssh JD date
Thu Nov 28 15:37:31 CST 2019
hadoop介紹
簡單介紹
-
地址
hadoop.apache.org
安裝部署地址:https://hadoop.apache.org/docs/r2.10.0/hadoop-project-dist/hadoop-common/SingleCluster.html
配置文件地址:https://hadoop.apache.org/docs/r2.10.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
apache的項目都是 xxx.apache.org,例如 spark.apache.org kafka.apache.org -
什麼是hadoop
廣義:以apache hadoop軟件爲主的生態圈(hive sqoop flume spark flink hbase 。。。。) 狹義:apache hadoop軟件 開源的
-
apache hadoop軟件:
1.x 基本不用 2.x 企業主流==》CDH5.x系列 3.x 嘗試使用==》CDH6.x系列
-
tar包地址
http://archive.cloudera.com/cdh5/cdh/5/
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.16.2.tar.gz
hadoop-2.6.0-cdh5.16.2.tar.gz 這個tar包,並不僅僅指的對應的是hadoop2.6.0,是apache hadoop2.6.0 + 以後的patch==apache hadoop2.9的版本
cdh對版本升級的日誌
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.16.2-changes.log
http://archive.cloudera.com/cdh5/cdh/5/hbase-1.2.0-cdh5.16.2-changes.log
選擇cdh的好處:版本兼容性
hadoop軟件
-
hdfs
負責存儲 -
mapreduce
負責計算,作業,有價值的數據挖掘
但是由於開發難度大、複雜度高、代碼量大、維護困難、計算速度慢,都使用hive sql 、spark、flink -
yarn
資源調度和作業調度
hdfs的僞分佈式部署
-
上傳文件並修改用戶、用戶組
使用root用戶 [root@JD /]# ll total 522100 -rw-r--r-- 1 root root 434354462 Nov 28 14:15 hadoop-2.6.0-cdh5.16.2.tar.gz [root@JD /]# mv hadoop-2.6.0-cdh5.16.2.tar.gz /home/hadoop/software/ 修改用戶用戶組 [root@JD /]# chown hadoop:hadoop /home/hadoop/software/* 查看是否修改成功 [root@JD /]# ll /home/hadoop/software total 424176 -rw-r--r-- 1 hadoop hadoop 434354462 Nov 28 14:15 hadoop-2.6.0-cdh5.16.2.tar.gz
-
創建文件夾並解壓tar包
切換hadoop用戶 [root@JD ~]# su - hadoop Last login: Thu Nov 28 15:27:14 CST 2019 on pts/1 Last failed login: Thu Nov 28 15:36:02 CST 2019 from 192.168.0.3 on ssh:notty There were 4 failed login attempts since the last successful login. [hadoop@JD ~]$ ll total 0 [hadoop@JD ~]$ pwd /home/hadoop 創建文件夾 [hadoop@JD ~]$ mkdir app software data sourcecode log tmp lib [hadoop@JD ~]$ ll total 0 drwxrwxr-x 2 hadoop hadoop 6 Nov 28 17:20 app#解壓的文件夾 軟連接 drwxrwxr-x 2 hadoop hadoop 6 Nov 28 17:20 data#存放數據 drwxrwxr-x 2 hadoop hadoop 6 Nov 28 17:20 lib#存放第三方jar drwxrwxr-x 2 hadoop hadoop 6 Nov 28 17:20 log#存放日誌文件 drwxrwxr-x 2 hadoop hadoop 6 Nov 28 17:20 software#存放tar包 drwxrwxr-x 2 hadoop hadoop 6 Nov 28 17:20 sourcecode#源代碼編譯 drwxrwxr-x 2 hadoop hadoop 6 Nov 28 17:20 tmp#臨時文件 解壓tar包 [hadoop@JD software]$ tar -zxvf hadoop-2.6.0-cdh5.16.2.tar.gz -C ../app/ [hadoop@JD software]$ cd ../app [hadoop@JD app]$ ll total 4 drwxr-xr-x 14 hadoop hadoop 4096 Jun 3 19:11 hadoop-2.6.0-cdh5.16.2 建立軟連接 [hadoop@JD app]$ ln -s hadoop-2.6.0-cdh5.16.2 hadoop [hadoop@JD app]$ ll total 4 lrwxrwxrwx 1 hadoop hadoop 22 Nov 28 17:32 hadoop -> hadoop-2.6.0-cdh5.16.2 drwxr-xr-x 14 hadoop hadoop 4096 Jun 3 19:11 hadoop-2.6.0-cdh5.16.2
-
配置jdk
[hadoop@JD app]$ cd hadoop/etc/hadoop/ [hadoop@JD hadoop]$ vi hadoop-env.sh export JAVA_HOME=/usr/java/jdk1.8.0_45
-
配置文件
配置core-site.xml<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://JD:9000</value> </property> </configuration>
配置hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
配置hadoop的個人環境變量
[hadoop@JD ~]$ pwd
/home/hadoop
編輯.bashrc文件,配置hadoop環境變量
[hadoop@JD ~]$ vi .bashrc
export HADOOP_HOME=/home/hadoop/app/hadoop
export PATH=${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:$PATH
使環境變量生效
[hadoop@JD ~]$ source .bashrc
查看是否生效
[hadoop@JD ~]$ which hadoop
~/app/hadoop/bin/hadoop
格式化
格式化(看到successfully成功)
[hadoop@JD ~]$ hdfs namenode -format
INFO common.Storage: Storage directory /tmp/hadoop-hadoop/dfs/name has been successfully formatted.
第一次啓動
第一次啓動時,我們發現還需要輸入yes,並沒有使用配置的JD主機名啓動,而是使用localhost和0.0.0.0啓動,所以我們要修改datanode和secondarynamenode的配置
啓動
[hadoop@JD ~]$ start-dfs.sh
19/11/28 17:59:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Starting namenodes on [JD]
JD: starting namenode, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/hadoop-hadoop-namenode-JD.out
The authenticity of host ‘localhost (::1)’ can’t be established.
ECDSA key fingerprint is SHA256:OLqoaMxlGFbCq4sC9pYgF+FdbcXHbEbtSrnMiGGFbVw.
ECDSA key fingerprint is MD5:d3:5b:4a:ef:8e:00:41:a0:5e:80:ef:75:76:8a:a3:49.
Are you sure you want to continue connecting (yes/no)? yes
localhost: Warning: Permanently added ‘localhost’ (ECDSA) to the list of known hosts.
localhost: starting datanode, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/hadoop-hadoop-datanode-JD.out
Starting secondary namenodes [0.0.0.0]
The authenticity of host ‘0.0.0.0 (0.0.0.0)’ can’t be established.
ECDSA key fingerprint is SHA256:OLqoaMxlGFbCq4sC9pYgF+FdbcXHbEbtSrnMiGGFbVw.
ECDSA key fingerprint is MD5:d3:5b:4a:ef:8e:00:41:a0:5e:80:ef:75:76:8a:a3:49.
Are you sure you want to continue connecting (yes/no)? yes
0.0.0.0: Warning: Permanently added ‘0.0.0.0’ (ECDSA) to the list of known hosts.
0.0.0.0: starting secondarynamenode, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/hadoop-hadoop-secondarynamenode-JD.out
19/11/28 18:00:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
[hadoop@JD ~]$ jps
27185 NameNode
27467 SecondaryNameNode
27309 DataNode
27583 Jps
配置DN SNN都以 JD啓動
NN:JD fs.defaultFS控制的在core-site.xml文件中
DN: slaves文件配置
SNN: hdfs-site.xml文件配置
[hadoop@JD hadoop]$ pwd
/home/hadoop/app/hadoop/etc/hadoop
刪除localhost,替換爲JD
[hadoop@JD hadoop]$ vi slaves
JD
添加屬性
[hadoop@JD hadoop]$ vi hdfs-site.xml
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>JD:50090</value>
</property>
<property>
<name>dfs.namenode.secondary.https-address</name>
<value>JD:50091</value>
</property>
重新啓動,都是使用JD啓動,並且不用輸入yes,jps發現進程都起來了,成功
[hadoop@JD ~]$ start-dfs.sh
19/11/28 18:16:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [JD]
JD: starting namenode, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/hadoop-hadoop-namenode-JD.out
JD: starting datanode, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/hadoop-hadoop-datanode-JD.out
Starting secondary namenodes [JD]
JD: starting secondarynamenode, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/hadoop-hadoop-secondarynamenode-JD.out
19/11/28 18:17:03 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@JD ~]$ jps
30343 Jps
30027 DataNode
29886 NameNode
30190 SecondaryNameNode
3.12官網的參數文件 在哪裏找
路徑:https://hadoop.apache.org/docs/r2.10.0/hadoop-project-dist/hadoop-common/core-default.xml
https://hadoop.apache.org/docs/r2.10.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
介紹三個節點和web登錄路徑
namenode 名稱節點 老大 讀寫請求先經過它 主節點
datanode 數據節點 小弟 存儲數據 檢索數據 從節點
secondarynamenode 第二名稱節點 老二 h+1
大數據組件基本都是主從架構 hdfs
hbase(讀寫請求不經過老大 master進程)
hdfs的web頁面:http://JD:50070
hadoop fs的常規命令
創建文件
[hadoop@JD ~]$ echo "111111" > aaa.txt
[hadoop@JD ~]$ ll
total 4
-rw-rw-r-- 1 hadoop hadoop 7 Nov 28 22:06 aaa.txt
drwxrwxr-x 3 hadoop hadoop 48 Nov 28 17:32 app
drwxrwxr-x 2 hadoop hadoop 6 Nov 28 17:20 data
drwxrwxr-x 2 hadoop hadoop 6 Nov 28 17:20 lib
drwxrwxr-x 2 hadoop hadoop 6 Nov 28 17:20 log
drwxrwxr-x 2 hadoop hadoop 42 Nov 28 17:22 software
drwxrwxr-x 2 hadoop hadoop 6 Nov 28 17:20 sourcecode
drwxrwxr-x 2 hadoop hadoop 6 Nov 28 17:20 tmp
上傳文件
[hadoop@JD ~]$ hadoop fs -put aaa.txt /
查看文件或文件夾
[hadoop@JD ~]$ hadoop fs -ls /
-rw-r--r-- 1 hadoop supergroup 7 2019-11-28 22:07 /aaa.txt
創建文件夾
[hadoop@JD ~]$ hadoop fs -mkdir /bigdata
[hadoop@JD ~]$ hadoop fs -ls /
-rw-r--r-- 1 hadoop supergroup 7 2019-11-28 22:07 /aaa.txt
drwxr-xr-x - hadoop supergroup 0 2019-11-28 22:11 /bigdata
下載文件
[hadoop@JD ~]$ hadoop fs -get /aaa.txt
刪除文件
[hadoop@JD ~]$ hadoop fs -rm /aaa.txt
[hadoop@JD ~]$ hadoop fs -ls /
drwxr-xr-x - hadoop supergroup 0 2019-11-28 22:11 /bigdata
配置文件的官方網址
hdfs-site.xml文件:https://hadoop.apache.org/docs/r2.10.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
core-site.xml文件:https://hadoop.apache.org/docs/r2.10.0/hadoop-project-dist/hadoop-common/core-default.xml
如何保證數據質量
本章節僅僅介紹解決數據量不相同的情況,內容不相同還不會。。。
存儲是大數據的基石,如果hdfs不掛,保存的數據真的準確嗎?
sqoop抽數據 無error 丟數據的 概率小
如果存儲disk掛了,數據還準確嗎?==》
如何校驗是否準確 ? 如何讓其準確? 機制 是不是必須要有!
數據質量校驗:先校驗數量是否相等,再校驗內容是否相等
數據量校驗 count
數據量相同–》數據量不相同 重刷機制 補or刪 spark 95%–》數據內容相同 5% 抽樣
查詢出上遊所有字段 下游只需查詢出主鍵即可
mysql 所有的字段 主鍵 a表
phoenix 主鍵 b表
a full outer join b
多 少
1356 135?
135?沒有來得及同步到下游 1356 沒有來得及刪 或者其他bug造成的
上游a表:
ID NAME AGE
1 xxx1 11
2 xxx2 12
3 xxx3 13
7 xxx7 17
下游b表
ID NAME AGE
1 xxx1 11
3 xxx3 13
5 xxx5 15
6 xxx6 16
數據重刷機制:用count校驗上下游的數據不準確
引入重刷機制:通過對上下游的兩個表求full outer join來對比字段的null值
進行full outer join
select a.id,a.name,a.age,b.id from a full join b on a1.id=b.id
結果爲
ID NAME AGE ID
1 ruoze1 18 1
2 ruoze2 19 null
3 ruoze3 20 3
7 ruoze7 22 null
null null null 5
null null null 6
然後查詢出aid爲null的數據來,aid爲null證明下游數據比上游數據多,需要刪除下游數據
select from t where aid=null 下游數據多的 需要根據bid 構建
delete from b where bid=5
delete from b where bid=6
最終sql爲
delete from b where id in (
select b.id from a
left join b
on a.id=b.id
where a.id is null
)
查詢出bid爲null的數據來,bid爲null證明下游數據比上游數據少,需要根據上游數據補充下游數據
select from t where bid=null 下游數據少的 需要根據上游的所有字段補充
insert into 2 ruoze2 19
insert into 7 ruoze7 22
最終sql爲
insert into b value(
select a.id ,a.name,a.age from b
right join a
on a.id=b.id
where b.id is null
)
full outer join 其實就是先 left join 和後 right join 的兩個結果,爲 null 的剛好是缺少的或者多的