hadoop+zookeeper+hbase+hive

 

hadoop安裝配置

一、集羣部署介紹

1.1 Hadoop簡介

Hadoop是Apache軟件基金會旗下的一個開源分佈式計算平臺。以Hadoop分佈式文件系統HDFS(Hadoop Distributed Filesystem)和MapReduce(Google MapReduce的開源實現)爲核心的Hadoop爲用戶提供了系統底層細節透明的分佈式基礎架構。

對於Hadoop的集羣來講,可以分成兩大類角色:Master和Salve。一個HDFS集羣是由一個NameNode和若干個DataNode組成的。其中NameNode作爲主服務器,管理文件系統的命名空間和客戶端對文件系統的訪問操作;集羣中的DataNode管理存儲的數據。MapReduce框架是由一個單獨運行在主節點上的JobTracker和運行在每個從節點的TaskTracker共同組成的。主節點負責調度構成一個作業的所有任 務,這些任務分佈在不同的從節點上。主節點監控它們的執行情況,並且重新執行之前的失敗任務;從節點僅負責由主節點指派的任務。當一個Job被提交時,JobTracker接收到提交作業和配置信息之後,就會將配置信息等分發給從節點,同時調度任務並監控TaskTracker的執行。

從上面的介紹可以看出,HDFS和MapReduce共同組成了Hadoop分佈式系統體系結構的核心。HDFS在集羣上實現分佈式文件系統,MapReduce在集羣上實現了分佈式計算和任務處理。HDFS在MapReduce任務處理過程中提供了文件操作和存儲等支持,MapReduce在HDFS的基礎上實現了任務的分發、跟蹤、執行等工作,並收集結果,二者相互作用,完成了Hadoop分佈式集羣的主要任務。

hadoop生態圈:

1.1.1 HDFS(Hadoop分佈式文件系統)

源自於Google的GFS論文,發表於2003年10月,HDFS是GFS克隆版。

HDFS是Hadoop體系中數據存儲管理的基礎。它是一個高度容錯的系統,能檢測和應對硬件故障,用於在低成本的通用硬件上運行。

HDFS簡化了文件的一致性模型,通過流式數據訪問,提供高吞吐量應用程序數據訪問功能,適合帶有大型數據集的應用程序。

它提供了一次寫入多次讀取的機制,數據以塊的形式,同時分佈在集羣不同物理機器上。

1.1.2 Mapreduce(分佈式計算框架)

源自於google的MapReduce論文,發表於2004年12月,Hadoop MapReduce是google MapReduce 克隆版。

MapReduce是一種分佈式計算模型,用以進行大數據量的計算。它屏蔽了分佈式計算框架細節,將計算抽象成map和reduce兩部分,

其中Map對數據集上的獨立元素進行指定的操作,生成鍵-值對形式中間結果。Reduce則對中間結果中相同“鍵”的所有“值”進行規約,以得到最終結果。

MapReduce非常適合在大量計算機組成的分佈式並行環境裏進行數據處理。

1.1.3 HBASE(分佈式列存數據庫)

源自Google的Bigtable論文,發表於2006年11月,HBase是Google Bigtable克隆版

HBase是一個建立在HDFS之上,面向列的針對結構化數據的可伸縮、高可靠、高性能、分佈式和麪向列的動態模式數據庫。

HBase採用了BigTable的數據模型:增強的稀疏排序映射表(Key/Value),其中,鍵由行關鍵字、列關鍵字和時間戳構成。

HBase提供了對大規模數據的隨機、實時讀寫訪問,同時,HBase中保存的數據可以使用MapReduce來處理,它將數據存儲和並行計算完美地結合在一起。

1.1.4 Zookeeper(分佈式協作服務)

源自Google的Chubby論文,發表於2006年11月,Zookeeper是Chubby克隆版

解決分佈式環境下的數據管理問題:統一命名,狀態同步,集羣管理,配置同步等。

Hadoop的許多組件依賴於Zookeeper,它運行在計算機集羣上面,用於管理Hadoop操作。

1.1.5 HIVE(數據倉庫)

由facebook開源,最初用於解決海量結構化的日誌數據統計問題。

Hive定義了一種類似SQL的查詢語言(HQL),將SQL轉化爲MapReduce任務在Hadoop上執行。通常用於離線分析。

HQL用於運行存儲在Hadoop上的查詢語句,Hive讓不熟悉MapReduce開發人員也能編寫數據查詢語句,然後這些語句被翻譯爲Hadoop上面的MapReduce任務

1.1.6 Pig(ad-hoc腳本)

由yahoo!開源,設計動機是提供一種基於MapReduce的ad-hoc(計算在query時發生)數據分析工具

Pig定義了一種數據流語言—Pig Latin,它是MapReduce編程的複雜性的抽象,Pig平臺包括運行環境和用於分析Hadoop數據集的腳本語言(Pig Latin)。

其編譯器將Pig Latin翻譯成MapReduce程序序列將腳本轉換爲MapReduce任務在Hadoop上執行。通常用於進行離線分析。

1.1.7 Sqoop(數據ETL/同步工具)

Sqoop是SQL-to-Hadoop的縮寫,主要用於傳統數據庫和Hadoop之前傳輸數據。數據的導入和導出本質上是Mapreduce程序,充分利用了MR的並行化和容錯性。

Sqoop利用數據庫技術描述數據架構,用於在關係數據庫、數據倉庫和Hadoop之間轉移數據。

1.1.8 Flume(日誌收集工具)

Cloudera開源的日誌收集系統,具有分佈式、高可靠、高容錯、易於定製和擴展的特點。

它將數據從產生、傳輸、處理並最終寫入目標的路徑的過程抽象爲數據流,在具體的數據流中,數據源支持在Flume中定製數據發送方,從而支持收集各種不同協議數據。

同時,Flume數據流提供對日誌數據進行簡單處理的能力,如過濾、格式轉換等。此外,Flume還具有能夠將日誌寫往各種數據目標(可定製)的能力。

總的來說,Flume是一個可擴展、適合複雜環境的海量日誌收集系統。當然也可以用於收集其他類型數據

1.1.9 Mahout(數據挖掘算法庫)

Mahout起源於2008年,最初是Apache Lucent的子項目,它在極短的時間內取得了長足的發展,現在是Apache的頂級項目。

Mahout的主要目標是創建一些可擴展的機器學習領域經典算法的實現,旨在幫助開發人員更加方便快捷地創建智能應用程序。

Mahout現在已經包含了聚類、分類、推薦引擎(協同過濾)和頻繁集挖掘等廣泛使用的數據挖掘方法。

除了算法,Mahout還包含數據的輸入/輸出工具、與其他存儲系統(如數據庫、MongoDB 或Cassandra)集成等數據挖掘支持架構。

1.1.10 Oozie(工作流調度器)

Oozie是一個可擴展的工作體系,集成於Hadoop的堆棧,用於協調多個MapReduce作業的執行。它能夠管理一個複雜的系統,基於外部事件來執行,外部事件包括數據的定時和數據的出現。

Oozie工作流是放置在控制依賴DAG(有向無環圖 Direct Acyclic Graph)中的一組動作(例如,Hadoop的Map/Reduce作業、Pig作業等),其中指定了動作執行的順序。

Oozie使用hPDL(一種XML流程定義語言)來描述這個圖。

1.1.11 Yarn(分佈式資源管理器)

YARN是下一代MapReduce,即MRv2,是在第一代MapReduce基礎上演變而來的,主要是爲了解決原始Hadoop擴展性較差,不支持多計算框架而提出的。 Yarn是下一代 Hadoop 計算平臺,yarn是一個通用的運行時框架,用戶可以編寫自己的計算框架,在該運行環境中運行。 用於自己編寫的框架作爲客戶端的一個lib,在運用提交作業時打包即可。該框架爲提供了以下幾個組件:

  • 資源管理:包括應用程序管理和機器資源管理

  • 資源雙層調度

  • 容錯性:各個組件均有考慮容錯性

  • 擴展性:可擴展到上萬個節點

  •  
  1. Mesos(分佈式資源管理器)

Mesos誕生於UC Berkeley的一個研究項目,現已成爲Apache項目,當前有一些公司使用Mesos管理集羣資源,比如Twitter。

與yarn類似,Mesos是一個資源統一管理和調度的平臺,同樣支持比如MR、steaming等多種運算框架。

1.1.12 Tachyon(分佈式內存文件系統)

Tachyon(/'tæki:ˌɒn/ 意爲超光速粒子)是以內存爲中心的分佈式文件系統,擁有高性能和容錯能力,

能夠爲集羣框架(如Spark、MapReduce)提供可靠的內存級速度的文件共享服務。

Tachyon誕生於UC Berkeley的AMPLab。

1.1.13 Tez(DAG計算模型)

Tez是Apache最新開源的支持DAG作業的計算框架,它直接源於MapReduce框架,核心思想是將Map和Reduce兩個操作進一步拆分,

即Map被拆分成Input、Processor、Sort、Merge和Output, Reduce被拆分成Input、Shuffle、Sort、Merge、Processor和Output等,

這樣,這些分解後的元操作可以任意靈活組合,產生新的操作,這些操作經過一些控制程序組裝後,可形成一個大的DAG作業。

目前hive支持mr、tez計算模型,tez能完美二進制mr程序,提升運算性能。

1.1.14 Spark(內存DAG計算模型)

Spark是一個Apache項目,它被標榜爲“快如閃電的集羣計算”。它擁有一個繁榮的開源社區,並且是目前最活躍的Apache項目。

最早Spark是UC Berkeley AMP lab所開源的類Hadoop MapReduce的通用的並行計算框架。

Spark提供了一個更快、更通用的數據處理平臺。和Hadoop相比,Spark可以讓你的程序在內存中運行時速度提升100倍,或者在磁盤上運行時速度提升10倍

1.1.15 Giraph(圖計算模型)

Apache Giraph是一個可伸縮的分佈式迭代圖處理系統, 基於Hadoop平臺,靈感來自 BSP (bulk synchronous parallel) 和 Google 的 Pregel。

最早出自雅虎。雅虎在開發Giraph時採用了Google工程師2010年發表的論文《Pregel:大規模圖表處理系統》中的原理。後來,雅虎將Giraph捐贈給Apache軟件基金會。

目前所有人都可以下載Giraph,它已經成爲Apache軟件基金會的開源項目,並得到Facebook的支持,獲得多方面的改進。

1.1.16 GraphX(圖計算模型)

Spark GraphX最先是伯克利AMPLAB的一個分佈式圖計算框架項目,目前整合在spark運行框架中,爲其提供BSP大規模並行圖計算能力。

1.1.17 MLib(機器學習庫)

Spark MLlib是一個機器學習庫,它提供了各種各樣的算法,這些算法用來在集羣上針對分類、迴歸、聚類、協同過濾等。

1.1.18 Streaming(流計算模型)

Spark Streaming支持對流數據的實時處理,以微批的方式對實時數據進行計算

1.1.19 Kafka(分佈式消息隊列)

Kafka是Linkedin於2010年12月份開源的消息系統,它主要用於處理活躍的流式數據。

活躍的流式數據在web網站應用中非常常見,這些數據包括網站的pv、用戶訪問了什麼內容,搜索了什麼內容等。

這些數據通常以日誌的形式記錄下來,然後每隔一段時間進行一次統計處理。

1.1.20 Phoenix(hbase sql接口)

Apache Phoenix 是HBase的SQL驅動,Phoenix 使得Hbase 支持通過JDBC的方式進行訪問,並將你的SQL查詢轉換成Hbase的掃描和相應的動作。

1.1.21 ranger(安全管理工具)

Apache ranger是一個hadoop集羣權限框架,提供操作、監控、管理複雜的數據權限,它提供一個集中的管理機制,管理基於yarn的hadoop生態圈的所有數據權限。

1.1.22 knox(hadoop安全網關)

Apache knox是一個訪問hadoop集羣的restapi網關,它爲所有rest訪問提供了一個簡單的訪問接口點,能完成3A認證(Authentication,Authorization,Auditing)和SSO(單點登錄)等

1.1.23 falcon(數據生命週期管理工具)

Apache Falcon 是一個面向Hadoop的、新的數據處理和管理平臺,設計用於數據移動、數據管道協調、生命週期管理和數據發現。它使終端用戶可以快速地將他們的數據及其相關的處理和管理任務“上載(onboard)”到Hadoop集羣。

1.1.24 Ambari(安裝部署配置管理工具)

Apache Ambari 的作用來說,就是創建、管理、監視 Hadoop 的集羣,是爲了讓 Hadoop 以及相關的大數據軟件更容易使用的一個web工具。

1.2 環境說明

我的環境是在虛擬機中配置的,Hadoop集羣中包括3個節點:1個Master,2個Salve,節點之間局域網連接,可以相互ping通,節點IP地址分佈如下:

系統版本 主機名 ip 角色
centos7.0 nodea 192.168.47.136 主機
centos7.0 nodeb 192.168.47.137 從機
centos7.0 nodec 192.168.47.138 從機

大數據平臺版本

名稱 版本 下載地址 說明
hadoop 2.8.0 https://archive.apache.org/dist/hadoop/core/ 大數據
zookeeper 3.4.10 http://mirror.bit.edu.cn/apache/zookeeper/ 分佈式協調組件
hbase 1.2.6 http://mirror.bit.edu.cn/apache/hbase/ 基於hdfs的數據庫
hive 2.3.3 http://mirrors.hust.edu.cn/apache/hive/stable-2/ 數據倉庫

1.2.1 資源下載

所需資源hadoop、hbase、zookeeper、jdk

1、JKD下載地址:http://pan.baidu.com/s/1i5NpImx

2、hadoop下載:https://archive.apache.org/dist/hadoop/core/

3、hbase下載:http://mirror.bit.edu.cn/apache/hbase/

4、zookeeper下載:http://mirror.bit.edu.cn/apache/zookeeper/

5、hive下載:http://mirrors.hust.edu.cn/apache/hive/stable-2/

1.3 環境配置

1.3.1 修改主機名

[root@nodea hadoop]# vi /etc/hostname
-------------------------------------------------------------------
文本內容修改爲:
nodea
-------------------------------------------------------------------

[root@nodea hadoop]# vi /etc/sysconfig/network
-------------------------------------------------------------------
HOSTNAME=nodea
-------------------------------------------------------------------

[root@nodea hadoop]# vi /etc/hosts
-------------------------------------------------------------------
新增
192.168.47.129 	nodeb
192.168.47.131 	nodec
192.168.47.130 	nodea

移除
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
-------------------------------------------------------------------

這裏如果不移除127.0.0.1,啓動hadoop,hdfs上傳文件會報錯
報錯信息如下:

2018-06-11 10:43:33,970 WARNorg.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server:nodea/192.168.47.130:8020

2018-06-11 10:43:55,009 INFOorg.apache.hadoop.ipc.Client: Retrying connect to server:nodea/192.168.47.130:8020. Already tried 0 time(s); retry policy isRetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

2018-06-11 10:43:56,012 INFOorg.apache.hadoop.ipc.Client: Retrying connect to server:nodea/192.168.47.130:8020. Already tried 1 time(s); retry policy isRetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

nodeb、nodec分別執行一遍上述操作,主機名修改爲nodeb、nodec

1.3.2 安裝JDK

1、查看安裝:

rpm -qa | grep jdk      查看所有已安裝jdk
-------------------------------------------------------------------
java-1.6.0-openjdk-1.6.0.0-1.66.1.13.0.el6.i686
-------------------------------------------------------------------
#卸載jdk
rpm -e --nodeps java-1.6.0-openjdk-1.6.0.0-1.66.1.13.0.el6.i686

上傳jdk,解壓、配置環境變量

-------------------------------------------------------------------
export JAVA_HOME=/opt/jdk8
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$JAVA_HOME/bin:$PATH
-------------------------------------------------------------------

二、SSH無密碼驗證配置

2.1 nodea、nodeb、nodec上生成密鑰對

[root@nodea hadoop]# ssh-keygen -t rsa         #一路回車就行
Generating public/private rsa key pair.
Enter file in which to save the key (/home/test/.ssh/id_rsa): 
Created directory '/home/test/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/test/.ssh/id_rsa.
Your public key has been saved in /home/test/.ssh/id_rsa.pub.
The key fingerprint is:
3a:da:74:04:44:ad:2d:7d:df:b1:f6:5f:96:1b:f6:b2 [email protected]
The key's randomart image is:
+--[ RSA 2048]----+
|     .o.         |
|     .  .        |
|      .+         |
|      o.o .   .  |
|       .S. . . o |
|       o    . + .|
|      + .    . =o|
|     + o      o.*|
|    . .       E++|
+-----------------+

--------------------------------------------------

[root@nodea hadoop]# ls -la .ssh  
總用量 16  
drwx------  2 usera usera 4096  8月 24 09:22 .  
drwxrwx--- 12 usera usera 4096  8月 24 09:22 ..  
-rw-------  1 usera usera 1675  8月 24 09:22 id_rsa  
-rw-r--r--  1 usera usera  399  8月 24 09:22 id_rsa.pub  

2.2 免密登錄配置

2.2.1 配置nodea--》nodeb免密登錄

  • nodea上執行:
[root@nodea hadoop]# ssh-copy-id root@nodeb	用戶名@主機名/ip

The authenticity of host '10.124.84.20 (10.124.84.20)' can't be established.  
RSA key fingerprint is f0:1c:05:40:d3:71:31:61:b6:ad:7c:c2:f0:85:3c:cf.  
Are you sure you want to continue connecting (yes/no)? yes  ##提示是否繼續鏈接,輸入yes
Warning: Permanently added '10.124.84.20' (RSA) to the list of known hosts.  
[email protected]'s password:      ##輸入slave的hadoop用戶密碼 
Now try logging into the machine, with "ssh '[email protected]'", and check in:  
  
  .ssh/authorized_keys  
  
to make sure we haven't added extra keys that you weren't expecting.

2.2.2 配置nodeb--》nodea免密登錄

  • nodeb上執行
[root@nodea hadoop]# ssh-copy-id root@nodea	用戶名@主機名/ip

The authenticity of host '10.124.84.20 (10.124.84.20)' can't be established.  
RSA key fingerprint is f0:1c:05:40:d3:71:31:61:b6:ad:7c:c2:f0:85:3c:cf.  
Are you sure you want to continue connecting (yes/no)? yes  ##提示是否繼續鏈接,輸入yes
Warning: Permanently added '10.124.84.20' (RSA) to the list of known hosts.  
[email protected]'s password:      ##輸入slave的hadoop用戶密碼 
Now try logging into the machine, with "ssh '[email protected]'", and check in:  
  
  .ssh/authorized_keys  
  
to make sure we haven't added extra keys that you weren't expecting.

這個時候nodeb下root用戶的公鑰文件內容會追加寫入到nodea下root用戶的 .ssh/authorized_keys 文件中

這樣做完之後我們就可以免密碼登錄了

這裏需要按照這種模式配置三臺機器相互免密登錄。例:nodea-》nodea nodea-》nodeb nodea-》nodec

2.3 免密登錄測試

執行 ssh nodeb 如果不需要輸入nodeb的密碼就可遠程連接,則配置成功。 否則刪除.ssh目錄重新配置。

2.4 關閉防火牆

  • 如果linux系統是centos7.0以下的版本執行如下命令
#臨時關閉防火牆(重啓後失效)
service iptables stop
#關閉開機自啓防火牆(重啓後生效)
chkconfig iptables off
#查看防火牆狀態
service iptables status
  • 如果linux系統是centos7.0及以上版本,執行如下命令
#臨時關閉
systemctl stop firewalld
#禁止開機啓動
systemctl disable firewalld
#查看防火牆狀態
systemctl status firewalld

三、hadoop集羣搭建

3.1 創建文件目錄

/opt/apps /root/hadoop/tmp /root/hadoop/var /root/hadoop/dfs/name /root/hadoop/dfs/data

nodea、nodeb、nodec都需要創建

3.2 下載上傳hadoop鏡像

這裏我選擇的hadoop-2.8.0 上傳到/opt/resource文件夾下,

[root@nodea resource]# tar -zxvf hadoop-2.8.0.tar.gz #解壓到當前目錄

[root@nodea resource]# mv -r /opt/resource/hadoop-2.8.0 /opt/apps/hadoop-2.8.0

3.3 配置環境變量

[root@nodea hadoop]# vi /etc/profile		
-------------------------------------------------------------------
export JAVA_HOME=/opt/jdk8
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$JAVA_HOME/bin:$PATH



export HADOOP_HOME=/opt/apps/hadoop-2.8.0
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
-------------------------------------------------------------------

[root@nodea hadoop]# source /etc/profile  #立即生效

上面1、2、3、4操作只在nodea上面操作

3.4 hadoop配置文件

下面所有配置文件都在/opt/apps/hadoop-2.8.0/etc/hadoop目錄下

  • 修改 hadoop-env.sh文件
-------------------------------------------------------------------
export JAVA_HOME=${JAVA_HOME}修改爲:
export JAVA_HOME=/opt/jdk8
-------------------------------------------------------------------
  • 修改yarn-env.sh文件
-------------------------------------------------------------------
修改JAVA_HOME爲:
export JAVA_HOME=/opt/jdk8
-------------------------------------------------------------------
  • core-site.xml配置
-------------------------------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
 <!-- 這個屬性用來指定namenode的hdfs協議的文件系統通信地址,可以指定一個主機+端口,也可以指定爲一個namenode服務(這個服務內部可以有多臺namenode實現ha的namenode服務) -->
   <property>

        <name>fs.default.name</name>

        <value>hdfs://nodea:9000</value>

   </property>
    <!-- 這個屬性用來指定namenode的hdfs協議的文件系統通信地址,可以指定一個主機+端口,也可以指定爲一個namenode服務(這個服務內部可以有多臺namenode實現ha的namenode服務) -->
 <property>

        <name>hadoop.tmp.dir</name>

        <value>/root/hadoop/tmp</value>

        <description>Abase for other temporary directories.</description>

   </property>

 <!-- 這個屬性用來執行文件IO緩衝區的大小-->
    <property>
   
        <name>io.file.buffer.size</name>

        <value>4096</value>

    </property>
</configuration>

-------------------------------------------------------------------
  • hdfs-site.xml配置
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<!-- namenode數據的存放地點。也就是namenode元數據存放的地方,記錄了hdfs系統中文件的元數據-->
   <name>dfs.name.dir</name>

   <value>/root/hadoop/dfs/name</value>

   <description>Path on the local filesystem where theNameNode stores the namespace and transactions logs persistently.</description>

</property>

<property>
<!-- datanode數據的存放地點。也就是block塊存放的目錄了-->
   <name>dfs.data.dir</name>

   <value>/root/hadoop/dfs/data</value>

   <description>Comma separated list of paths on the localfilesystem of a DataNode where it should store its blocks.</description>

</property>

 <!-- hdfs的副本數設置。也就是上傳一個文件,其分割爲block塊後,每個block的冗餘副本個數-->
<property>

   <name>dfs.replication</name>

   <value>2</value>

</property>

<property>
<!-- 設置爲false後。可以允許不要檢查權限就生成dfs文件,但是需要防止誤刪除,請將它設置爲true,或者刪除該節點,因爲他默認是true -->
      <name>dfs.permissions</name>

      <value>false</value>

      <description>need not permissions</description>

</property>
 <property>
          <!-- hdfs文件塊大小 這裏設置128M -->
        <name>dfs.block.size</name>
        <value>134217728</value>
    </property>
<property>
   <!-- 開啓hdfs的web訪問接口。好像默認端口是50070-->
        <name>dfs.http.address</name>
        <value>nodea:50070</value>
    </property>
    <property>
        <name>dfs.secondary.http.address</name>
        <value>nodea:50090</value>
    </property>
    <property>
    <!-- 開啓hdfs的web訪問接口。好像默認端口是50070-->
        <name>dfs.webhdfs.enabled</name>
        <value>true</value>
    </property>
    
</configuration>

-------------------------------------------------------------------
  • mapred-site.xml配置
mv mapred-site.xml.template mapred-site.xml   
vi mapred-site.xml

-------------------------------------------------------------------
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>

   <name>mapred.job.tracker</name>

   <value>nodea:49001</value>

</property>

<property>

      <name>mapred.local.dir</name>

       <value>/root/hadoop/var</value>

</property>

<property>
 <!---- 指定mr框架爲yarn方式,Hadoop二代MP也基於資源管理系統Yarn來運行 -->
       <name>mapreduce.framework.name</name>

       <value>yarn</value>

</property>

 <property>
  <!---- 指定mr框架web查看的地址 -->
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>nodea:19888</value>
    </property>
</configuration>



-------------------------------------------------------------------

  • yarn-site.xml配置
-------------------------------------------------------------------
<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

<!-- Site specific YARN configuration properties -->
<property>

        <name>yarn.resourcemanager.hostname</name>

        <value>nodea</value>

   </property>

   <property>

        <description>The address of the applications manager interface in the RM.</description>
        <!--yarn總管理器的IPC通訊地址-->
        <name>yarn.resourcemanager.address</name>

        <value>${yarn.resourcemanager.hostname}:8032</value>

   </property>

   <property>

        <description>The address of the scheduler interface.</description>
        <!--yarn總管理器調度程序的IPC通訊地址-->
        <name>yarn.resourcemanager.scheduler.address</name>

        <value>${yarn.resourcemanager.hostname}:8030</value>

   </property>

   <property>

        <description>The http address of the RM web application.</description>
        <!--yarn總管理器的web http通訊地址-->
        <name>yarn.resourcemanager.webapp.address</name>

        <value>${yarn.resourcemanager.hostname}:8088</value>

   </property>

   <property>

        <description>The https adddress of the RM web application.</description>
        <!--yarn總管理器的web https通訊地址-->
        <name>yarn.resourcemanager.webapp.https.address</name>

        <value>${yarn.resourcemanager.hostname}:8090</value>

   </property>

   <property>
        <!--yarn總管理器的IPC通訊地址-->
        <name>yarn.resourcemanager.resource-tracker.address</name>

        <value>${yarn.resourcemanager.hostname}:8031</value>

   </property>

   <property>

        <description>The address of the RM admin interface.</description>
        <!--yarn總管理器的IPC管理地址-->
        <name>yarn.resourcemanager.admin.address</name>

        <value>${yarn.resourcemanager.hostname}:8033</value>

   </property>

   <property>

        <name>yarn.nodemanager.aux-services</name>

        <value>mapreduce_shuffle</value>

   </property>

   <property>
        <!-- 每個container最多申請的內存上限 -->
        <name>yarn.scheduler.maximum-allocation-mb</name>

        <value>2048</value>

        <discription>默認是 8182MB</discription>

   </property>

   <property>
        <!-- 最大比率爲一些任務的虛擬內存使用量可能會超過物理內存率 -->
        <name>yarn.nodemanager.vmem-pmem-ratio</name>

        <value>2.1</value>

   </property>

   <property>
        <!-- 資源的可用物理內存 -->
        <name>yarn.nodemanager.resource.memory-mb</name>

        <value>2048</value>

</property>

   <property>
        <!-- 忽略虛擬內存的檢查,如果你是安裝在虛擬機上,這個配置很有用,配上去之後後續操作不容易出問題。如果是實體機上,並且內存夠多,可以將這個配置去掉。  -->
        <name>yarn.nodemanager.vmem-check-enabled</name>

        <value>false</value>

</property>

</configuration>

-------------------------------------------------------------------
  • slave配置
-------------------------------------------------------------------
nodeb
nodec
-------------------------------------------------------------------

6. 配置文件拷貝到從機

[root@nodea hadoop]# scp -r /opt/apps root@nodeb:/opt/apps/hadoop2.8.0
[root@nodea hadoop]# scp -r /opt/apps root@nodec:/opt/apps/hadoop2.8.0

7. 配置nodeb、nodec環境變量

[root@nodea hadoop]# vi /etc/profile			配置環境變量
-------------------------------------------------------------------
export JAVA_HOME=/opt/jdk/jdk1.7.0_67
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$JAVA_HOME/bin:$PATH


export HADOOP_HOME=/opt/apps/hadoop-2.8.0
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

-------------------------------------------------------------------

[root@nodea tmp]# source /etc/profile

[root@nodea tmp]# hadoop version
Hadoop 2.7.1
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 15ecc87ccf4a0228f35af08fc56de536e6ce657a
Compiled by jenkins on 2015-06-29T06:04Z
Compiled with protoc 2.5.0
From source with checksum fc0a1a23fc1868e4d5ee7fa2b28a58a
This command was run using /opt/hadoop/apps/hadoop-2.7.1/share/hadoop/common/hadoop-common-2.7.1.jar

四、運行hadoop

4.1 格式化namenode

在nodea上面執行 hadoop namenode -format,對hdfs進行格式化

[root@nodea hadoop]# cd /opt/apps/hadoop-2.8.0/bin
[root@nodea hadoop]# ./hadoop namenode -format
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

18/06/11 19:41:36 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = nodeb/127.0.0.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 2.7.1
STARTUP_MSG:   classpath = /opt/hadoop/apps/hadoop-2.7.1/etc/hadoop:/opt/hadoop/apps/hadoop-2.7.1/share/hadoop/common/lib/commons-configuration-1.6.jar..........................
 compiled by 'jenkins' on 2015-06-29T06:04Z
STARTUP_MSG:   java = 1.7.0_67
************************************************************/
18/06/11 19:41:36 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
18/06/11 19:41:36 INFO namenode.NameNode: createNameNode [-format]
Formatting using clusterid: CID-93f976dc-7a83-43c6-bb55-38c6f900b76e
18/06/11 19:41:38 INFO namenode.FSNamesystem: No KeyProvider found.
18/06/11 19:41:38 INFO namenode.FSNamesystem: fsLock is fair:true
18/06/11 19:41:38 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
18/06/11 19:41:38 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
18/06/11 19:41:38 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
18/06/11 19:41:38 INFO blockmanagement.BlockManager: The block deletion will start around 2018 Jun 11 19:41:38
18/06/11 19:41:38 INFO util.GSet: Computing capacity for map BlocksMap
18/06/11 19:41:38 INFO util.GSet: VM type       = 64-bit
18/06/11 19:41:38 INFO util.GSet: 2.0% max memory 966.7 MB = 19.3 MB
18/06/11 19:41:38 INFO util.GSet: capacity      = 2^21 = 2097152 entries
18/06/11 19:41:38 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false
18/06/11 19:41:38 INFO blockmanagement.BlockManager: defaultReplication         = 2
18/06/11 19:41:38 INFO blockmanagement.BlockManager: maxReplication             = 512
18/06/11 19:41:38 INFO blockmanagement.BlockManager: minReplication             = 1
18/06/11 19:41:38 INFO blockmanagement.BlockManager: maxReplicationStreams      = 2
18/06/11 19:41:38 INFO blockmanagement.BlockManager: shouldCheckForEnoughRacks  = false
18/06/11 19:41:38 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
18/06/11 19:41:38 INFO blockmanagement.BlockManager: encryptDataTransfer        = false
18/06/11 19:41:38 INFO blockmanagement.BlockManager: maxNumBlocksToLog          = 1000
18/06/11 19:41:38 INFO namenode.FSNamesystem: fsOwner             = root (auth:SIMPLE)
18/06/11 19:41:38 INFO namenode.FSNamesystem: supergroup          = supergroup
18/06/11 19:41:38 INFO namenode.FSNamesystem: isPermissionEnabled = false
18/06/11 19:41:38 INFO namenode.FSNamesystem: HA Enabled: false
18/06/11 19:41:38 INFO namenode.FSNamesystem: Append Enabled: true
18/06/11 19:41:38 INFO util.GSet: Computing capacity for map INodeMap
18/06/11 19:41:38 INFO util.GSet: VM type       = 64-bit
18/06/11 19:41:38 INFO util.GSet: 1.0% max memory 966.7 MB = 9.7 MB
18/06/11 19:41:38 INFO util.GSet: capacity      = 2^20 = 1048576 entries
18/06/11 19:41:38 INFO namenode.FSDirectory: ACLs enabled? false
18/06/11 19:41:38 INFO namenode.FSDirectory: XAttrs enabled? true
18/06/11 19:41:38 INFO namenode.FSDirectory: Maximum size of an xattr: 16384
18/06/11 19:41:38 INFO namenode.NameNode: Caching file names occuring more than 10 times
18/06/11 19:41:38 INFO util.GSet: Computing capacity for map cachedBlocks
18/06/11 19:41:38 INFO util.GSet: VM type       = 64-bit
18/06/11 19:41:38 INFO util.GSet: 0.25% max memory 966.7 MB = 2.4 MB
18/06/11 19:41:38 INFO util.GSet: capacity      = 2^18 = 262144 entries
18/06/11 19:41:38 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
18/06/11 19:41:38 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
18/06/11 19:41:38 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension     = 30000
18/06/11 19:41:38 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
18/06/11 19:41:38 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
18/06/11 19:41:38 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
18/06/11 19:41:38 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
18/06/11 19:41:38 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
18/06/11 19:41:38 INFO util.GSet: Computing capacity for map NameNodeRetryCache
18/06/11 19:41:38 INFO util.GSet: VM type       = 64-bit
18/06/11 19:41:38 INFO util.GSet: 0.029999999329447746% max memory 966.7 MB = 297.0 KB
18/06/11 19:41:38 INFO util.GSet: capacity      = 2^15 = 32768 entries
18/06/11 19:41:38 INFO namenode.FSImage: Allocated new BlockPoolId: BP-2047832718-127.0.0.1-1528771298722
18/06/11 19:41:38 INFO common.Storage: Storage directory /opt/hadoop/apps/hadoopdata/hdfs/name has been successfully formatted.
18/06/11 19:41:39 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
18/06/11 19:41:39 INFO util.ExitUtil: Exiting with status 0
18/06/11 19:41:39 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at nodea/127.0.0.1
************************************************************/

4.2 啓動服務

有兩種啓動方式

4.2.1 啓動所有服務

[root@nodea tmp]# cd /opt/hadoop/apps/hadoop-2.7.1/sbin/
[root@nodea tmp]# ./start-all.sh    #啓動所有服務

  • nodea上執行:
[root@nodea sbin]# jps	主
11952 ResourceManager
11532 NameNode
11775 SecondaryNameNode
13233 Jps
  • nodeb、nodec上執行
[root@nodeb sbin]# jps
10626 DataNode
10928 Jps
10795 NodeManager

4.2.2 分別啓動服務

  • nodea、nodeb、nodec上分別執行,啓動namenode、datanode
nodea:/opt/hadoop/apps/hadoop-2.7.1/sbin/hadoop-daemon.sh start namenode

nodeb:/opt/hadoop/apps/hadoop-2.7.1/sbin/hadoop-daemons.sh start datanode

nodec:/opt/hadoop/apps/hadoop-2.7.1/sbin/hadoop-daemons.sh start datanode

  • 運行yarn
nodea:/opt/hadoop/apps/hadoop-2.7.1/sbin/start-yarn.sh 
  • 運行hdfs:
nodea:/opt/hadoop/apps/hadoop-2.7.1/sbin/start-dfs.sh

五、測試hadoop

5.1 測試hdfs

hdfs訪問網址:http://nodea:50070 

put命令上傳文件

#hdfs上創建文件夾
hadoop fs -mkdir -p /movie
#先創建上這個文件夾,下面集成hbase需要用到
hadoop fs -mkdir -p /movie
#上傳文件
hadoop fs -put /opt/tt.txt /movie
#查看movie下文件
hadoop fs -ls /movie

 

5.2 查看集羣狀態

[root@nodea hadoop]# /opt/hadoop/apps/hadoop-2.7.1/bin/hdfs dfsadmin -report

image

5.3 測試YARN

可以訪問YARN的管理界面,驗證YARN,訪問地址:http://nodea:8088,如圖:

5.4 測試mapreduce

不想編寫mapreduce代碼。幸好Hadoop安裝包裏提供了現成的例子,在Hadoop的share/hadoop/mapreduce目錄下。運行例子:

[root@nodea hadoop]# /opt/apps/hadoop-2.8.0/bin/hadoop jar /opt/hadoop/apps/hadoop-2.8.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar pi 5 10

結果如圖: nulluploading.4e448015.gif轉存失敗重新上傳取消mapreduce.png 

5.5 測試查看HDFS

訪問地址:http://nodea:50070 

六、配置運行Hadoop中遇見的問題

6.1 JAVA_HOME未設置

啓動的時候報:
localhost:Error:JAVA_HOME is not set or could not be found

 

則需要/opt/hadoop/apps/hadoop-2.7.1/etc/hadoop/hadoop-env.sh,添加JAVA_HOME路徑

6.2 ncompatible clusterIDs

由於配置Hadoop集羣不是一蹴而就的,所以往往伴隨着配置——>運行——>。。。——>配置——>運行的過程,所以DataNode啓動不了時,往往會在查看日誌後,發現以下問題: 此問題是由於每次啓動Hadoop集羣時,會有不同的集羣ID,所以需要清理啓動失敗節點上data目錄(比如我創建的/home/jiaan.gja/hdfs/data)中的數據。

6.2 啓動hadoop後datanode、namenode能正常啓動但是datanode無法連接namenode

查看datanode log日誌發現報錯信息如下:

2018-06-11 10:43:33,970 WARNorg.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server:nodea/192.168.47.130:8020

2018-06-11 10:43:55,009 INFOorg.apache.hadoop.ipc.Client: Retrying connect to server:nodea/192.168.47.130:8020. Already tried 0 time(s); retry policy isRetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

2018-06-11 10:43:56,012 INFOorg.apache.hadoop.ipc.Client: Retrying connect to server:nodea/192.168.47.130:8020. Already tried 1 time(s); retry policy isRetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

 

 其中datanode與namenode的Ip都可以ping通。

其實根本原因還是無法連接到202.106.199.36:9000 相應ip的相應端口。

解決方法:

  1. 修改主機名文件
vi /etc/hosts       將#後面的內容刪除掉
-------------------------------------------------------------------
#127.0.0.1   localhost localhost.localdomain localhost4
#127.0.0.1       localhost

#127.0.0.1      nodea


192.168.47.130 nodea

192.168.47.129 nodeb

192.168.47.131 nodec

-------------------------------------------------------------------
  1. 關閉防火牆
  • 如果linux系統是centos7.0以下的版本執行如下命令
#臨時關閉防火牆(重啓後失效)
service iptables stop
#關閉開機自啓防火牆(重啓後生效)
chkconfig iptables off
  • 如果linux系統是centos7.0及以上版本,執行如下命令
#臨時關閉
systemctl stop firewalld
#禁止開機啓動
systemctl disable firewalld

6.4 hdfs使用put上傳文件報錯 報錯信息如下:

[root@nodea init.d]# hadoop fs -put /opt/movie/test.txt /movie
18/06/19 01:21:41 INFO hdfs.DFSClient: Exception in createBlockOutputStream
java.net.NoRouteToHostException: No route to host
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
        at org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1508)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1284)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1237)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:449)
18/06/19 01:21:41 INFO hdfs.DFSClient: Abandoning BP-1117584636-127.0.0.1-1529392562608:blk_1073741825_1001
18/06/19 01:21:41 INFO hdfs.DFSClient: Excluding datanode DatanodeInfoWithStorage[192.168.47.130:50010,DS-8dacdd17-e096-4dd0-a5b7-9b7ed189cb4a,DISK]
18/06/19 01:21:41 INFO hdfs.DFSClient: Exception in createBlockOutputStream
java.net.NoRouteToHostException: No route to host
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
        at org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1508)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1284)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1237)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:449)
18/06/19 01:21:41 INFO hdfs.DFSClient: Abandoning BP-1117584636-127.0.0.1-1529392562608:blk_1073741826_1002
18/06/19 01:21:41 INFO hdfs.DFSClient: Excluding datanode DatanodeInfoWithStorage[192.168.47.131:50010,DS-570f444b-a0ef-45d7-bc4c-66e6f7c84871,DISK]
18/06/19 01:21:41 WARN hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /movie/test.txt._COPYING_ could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and 2 node(s) are excluded in this operation.
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1550)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3110)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3034)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:723)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)

        at org.apache.hadoop.ipc.Client.call(Client.java:1476)
        at org.apache.hadoop.ipc.Client.call(Client.java:1407)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
        at com.sun.proxy.$Proxy14.addBlock(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:418)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
        at com.sun.proxy.$Proxy15.addBlock(Unknown Source)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1430)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1226)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:449)
put: File /movie/test.txt._COPYING_ could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and 2 node(s) are excluded in this operation.

這個是由於防火牆沒有關閉,關閉防火牆就可以了

關閉防火牆

  • 如果linux系統是centos7.0以下的版本執行如下命令
#臨時關閉防火牆(重啓後失效)
service iptables stop
#關閉開機自啓防火牆(重啓後生效)
chkconfig iptables off
  • 如果linux系統是centos7.0及以上版本,執行如下命令
#臨時關閉
systemctl stop firewalld
#禁止開機啓動
systemctl disable firewalld

七 注意事項

詳細過程參見官方文檔,這裏只介紹常見的一些問題:

  • 首先是版本的選擇,一般選擇cloudera 的cdh版,注意相互之間的兼容性,否則出現莫名其妙的問題都不知道怎麼解決。

  • 配置ssh五密碼訪問時要注意,.ssh目錄的權限問題,跟各個開發包一樣,各節點必須一致,否則會出現啓動Hadoop時讓手動輸密碼。

  • 在配置conf下文件時要注意,某些屬性的值必須是hadoop程序有寫權限的目錄,比如:hadoop.tmp.dir

  • Hadoop-env.sh中要配置JAVA_HOME,不管profile或.bash_profile有沒配置

  • hive的配置只要關聯正確hadoop的namenode即可,元數據庫可用默認的derby,也可通過修改配置實用mysql

  • hbase的master最好不用作regionserver。

  • zk的連接數要改的大一點,默認是30個,並且儘量與hadoop node節點分開,因爲hadoop的暫時負擔過重等異常會嚴重影響zk與hbase的正常工作,比如導致zk長時間選舉不出leader,hbase 各節點會相繼掛掉。

  • 安裝oozie依賴ext包,因爲console會用到這個框架,console的時間默認顯示GMT格式,看着很彆扭,但不知道怎麼改成 GMT +8 北京時間,who can tell me?

  • sqoop解壓後,要配置SQOOP_HOME,hdfs需要跟那種類型的RDB交互就下相應的JDBC驅動,放入lib下。

  • hadoop,hbase,需要在各自的集羣中每個節點都安裝,zookeeper根據需要安裝,一般奇數個,數量越多,選舉負擔中,但數量越少,系統穩定性下降,使用時跟據實際情況選擇方案,hive,oozie,sqoop只需要在需要執行客戶端程序的機器上安裝,只要能連上hadoop。


zookeeper集羣安裝配置

單機版安裝配置並設置開機啓動請參考:http://blog.csdn.net/pucao_cug/article/details/71240246

一 zookeeper集羣進行配置

1.1 資源上傳

上傳zookeeper安裝包資源到/opt/resources

1.2 創建目錄

/opt/apps/zookeeper /opt/apps/data /opt/apps/dataLog

在/opt/apps/data中創建myid文件 vi /opt/apps/data/myid


nodea myid內容上:1
    nodeb myid內容上:2
    nodec myid內容上:3

這裏myid內容要和下面配置的zoo.cfg一致,下面會詳細說明

:wq保存

1.3 解壓

tar -zxvf zookeeper-3.4.10.tar.gz

mv zookeeper-3.4.10 /opt/apps/zookeeper/zookeeper-3.4.10

1.4 配置zoo.cfg

cd /opt/apps/zookeeper/zookeepre-3.4.10/conf
cp zoo_sample.cfg zoo.cfg
vi zoo.cfg
-------------------------------------------------------------------

# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial 
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between 
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just 
# example sakes.

# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
#maxClientCnxns=60
#
# Be sure to read the maintenance section of the 
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1

dataDir=/opt/apps/zookeeper/data
dataLogDir=/opt/apps/zookeeper/dataLog

server.1=nodea:2888:3888
server.2=nodeb:2888:3888
server.3=nodec:2888:3888
-------------------------------------------------------------------

說明:dataDir和dataLogDir需要自己創建,目錄可以自己制定,對應即可。server.1中的這個1需要和nodea這個機器上的dataDir目錄中的myid文件中的數值對應。server.2中的這個2需要和nodeb這個機器上的dataDir目錄中的myid文件中的數值對應。server.3中的這個3需要和nodec這個機器上的dataDir目錄中的myid文件中的數值對應。當然,數值你可以隨便用,只要對應即可。2888和3888的端口號也可以隨便用,因爲在不同機器上,用成一樣也無所謂。

1.5 配置環境變量

vi /etc/profile

-------------------------------------------------------------------
export JAVA_HOME=/opt/jdk8
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$JAVA_HOME/bin:$PATH


export HADOOP_HOME=/opt/apps/hadoop-2.8.0
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

export ZOOKEEPER_HOME=/opt/apps/zookeeper/zookeeper-3.4.10
export PATH=$ZOOKEEPER_HOME/bin:$ZOOKEEPER_HOME/conf:$PATH
-------------------------------------------------------------------

二 啓動、測試集羣

2.1 啓動集羣

三臺機器分別執行

zkService.sh start

-------------------------------------------------------------------

Zookeeper JMX enabled by default
using config:/opt/apps/zookeeper/zookeeper-3.4.10/bin../conf/zoo.cfg
Starting zooleeper ... STARTED
-------------------------------------------------------------------

2.2 測試

zkService.sh status

-------------------------------------------------------------------
ZooKeeper JMX enabled by default
Using config:/opt/zookeeper/zookeeper-3.4.10/bin/../conf/zoo.cfg
Mode: follower
-------------------------------------------------------------------

hbase集羣安裝配置

一 資源上傳、解壓縮

1.1 創建目錄

/root/hbase/tmp
/root/hbase/pids
/opt/apps/hbase

nodea、nodeb、nodec上面都需要創建目錄

1.2 資源上傳、解壓

上傳到/opt/resource目錄下

tar -zxvf hbase-1.2.6.1.tar.gz

mv hbase-1.2.6.1 /opt/apps/hbase/hbase-1.2.6.1

二 hbase安裝配置

2.1 配置環境變量

vi /etc/profile

-------------------------------------------------------------------
# /etc/profile

# System wide environment and startup programs, for login setup
# Functions and aliases go in /etc/bashrc

# It's NOT a good idea to change this file unless you know what you
# are doing. It's much better to create a custom.sh shell script in
# /etc/profile.d/ to make custom changes to your environment, as this
# will prevent the need for merging in future updates.

pathmunge () {
    case ":${PATH}:" in
        *:"$1":*)
            ;;
        *)
            if [ "$2" = "after" ] ; then
                PATH=$PATH:$1
            else
                PATH=$1:$PATH
            fi
    esac
}


if [ -x /usr/bin/id ]; then
    if [ -z "$EUID" ]; then
        # ksh workaround
        EUID=`id -u`
        UID=`id -ru`
    fi
    USER="`id -un`"
    LOGNAME=$USER
    MAIL="/var/spool/mail/$USER"
fi

# Path manipulation
if [ "$EUID" = "0" ]; then
    pathmunge /usr/sbin
    pathmunge /usr/local/sbin
else
    pathmunge /usr/local/sbin after
    pathmunge /usr/sbin after
fi

HOSTNAME=`/usr/bin/hostname 2>/dev/null`
HISTSIZE=1000
if [ "$HISTCONTROL" = "ignorespace" ] ; then
    export HISTCONTROL=ignoreboth
else
    export HISTCONTROL=ignoredups
fi

export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE HISTCONTROL

# By default, we want umask to get set. This sets it for login shell
# Current threshold for system reserved uid/gids is 200
# You could check uidgid reservation validity in
# /usr/share/doc/setup-*/uidgid file
if [ $UID -gt 199 ] && [ "`id -gn`" = "`id -un`" ]; then
    umask 002
else
    umask 022
fi

for i in /etc/profile.d/*.sh ; do
    if [ -r "$i" ]; then
        if [ "${-#*i}" != "$-" ]; then 
            . "$i"
        else
            . "$i" >/dev/null
        fi
    fi
done

unset i
unset -f pathmunge

export JAVA_HOME=/opt/jdk8
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$JAVA_HOME/bin:$PATH


export HADOOP_HOME=/opt/apps/hadoop-2.8.0
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

export ZOOKEEPER_HOME=/opt/apps/zookeeper/zookeeper-3.4.10
export PATH=$ZOOKEEPER_HOME/bin:$ZOOKEEPER_HOME/conf:$PATH

export HBASE_HOME=/opt/apps/hbase/hbase-1.2.6.1
export PATH=$HBASE_HOME/bin:$PATH

-------------------------------------------------------------------

source /etc/profile

2.2 hbase配置文件

對/opt/hbase/hbase-1.2.6.1/conf目錄下的一系列文件做配置。使用命令進入到該目錄:

cd /opt/apps/hbase/hbase-1.2.6.1/conf

2.2.1 修改hbase-env.sh文件

vi /opt/apps/hbase/hbase-1.2.6.1/conf/hbase-env.sh

-------------------------------------------------------------------
export JAVA_HOME=/opt/jdk8
export HADOOP_HOME=/opt/apps/hadoop-2.8.0
export HBASE_HOME=/opt/apps/hbase/hbase-1.2.6.1
export HBASE_CLASSPATH=/opt/apps/hbase/hbase-1.2.6.1/etc/hadoop
export HBASE_PID_DIR=/root/hbase/pids
# true代表使用hbase自帶zookeeper,false代表使用自己安裝的zookeeper
export HBASE_MANAGES_ZK=false
-------------------------------------------------------------------

2.2.2 修改配置文件hbase-site.xml

vi /opt/apps/hbase/hbase-1.2.6.1/conf/hbase-env.sh

-------------------------------------------------------------------
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
/**
 *
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
-->
<configuration>
<property>
<!--這裏要和hadoop配置文件core-site.xml中配置的fs.default.name前面相同 後面多一個hbase文件夾 -->
 <name>hbase.rootdir</name>
 <value>hdfs://nodea:9000/hbase</value>
 <description>The directory shared byregion servers.</description>
</property>
<property>
<!-- 注意:這裏端口號要和zookeeper配置文件zoo.cfg中配置的相同 -->
 <name>hbase.zookeeper.property.clientPort</name>
 <value>2181</value>
 <description>Property from ZooKeeper'sconfig zoo.cfg. The port at which the clients will connect.
 </description>
</property>
<property>
 <name>zookeeper.session.timeout</name>
 <value>120000</value>
</property>
<property>
<!-- 指定加入zookeeper的主機名 -->
 <name>hbase.zookeeper.quorum</name>
 <value>nodea,nodeb,nodec</value>
</property>
<property>
 <name>hbase.tmp.dir</name>
 <value>/root/hbase/tmp</value>
</property>
<property>
 <name>hbase.cluster.distributed</name>
 <value>true</value>
</property>
</configuration>

-------------------------------------------------------------------

2.2.3 修改regionservers文件

將該文件內容修改爲:
-------------------------------------------------------------------
nodea
nodeb
nodec
-------------------------------------------------------------------

2.3 nodeb、nodec安裝hbase

scp -r /opt/apps/hbase/hbase-1.2.6.1 root@nodeb:/opt/apps/hbase/hbase-1.2.6.1

scp -r /opt/apps/hbase/hbase-1.2.6.1 root@nodec:/opt/apps/hbase/hbase-1.2.6.1

3 啓動和測試

3.1 啓動

Hbase是基於hadoop提供的分佈式文件系統的,所以啓動Hbase之前,先確保hadoop在正常運行,另外Hbase還依賴於zookkeeper,本來我們可以用hbase自帶的zookeeper,但是我們上面的配置啓用的是我們自己的zookeeper集羣,所以在啓動hbase前,還要確保zokeeper已經正常運行。

Hbase可以只在hadoop的某個namenode節點上安裝,也可以在所有的hadoop節點上安裝,但是啓動的時候只需要在一個節點上啓動就行了,本例中,我在nodea、nodeb、nodec都安裝了Hbase,啓動的時候只需要在nodea上啓動就OK。

cd /opt/apps/hbase/hbase-1.2.6.1/bin
./start-hbase.sh
#當然如果我們已經配置過環境變量(profile文件),就可以在任意目錄下執行:start-hbase.sh來啓動hbase

-------------------------------------------------------------------
starting master,logging to /opt/hbase/hbase-1.2.5/logs/hbase-root-master-hserver1.out
Java HotSpot(TM)64-Bit Server VM warning: ignoring option PermSize=128m; support was removed in8.0
Java HotSpot(TM)64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removedin 8.0
hserver3:starting regionserver, logging to /opt/hbase/hbase-1.2.5/logs/hbase-root-regionserver-hserver3.out
hserver2:starting regionserver, logging to/opt/hbase/hbase-1.2.5/logs/hbase-root-regionserver-hserver2.out
hserver2: JavaHotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=128m; supportwas removed in 8.0
hserver2: JavaHotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; supportwas removed in 8.0
hserver3: JavaHotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=128m; supportwas removed in 8.0
hserver3: Java HotSpot(TM)64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removedin 8.0
hserver1:starting regionserver, logging to/opt/hbase/hbase-1.2.5/logs/hbase-root-regionserver-hserver1.out

-------------------------------------------------------------------

3.2 測試

3.2.1 用瀏覽器訪問Hbase狀態信息

直接訪問地址:http://nodea:16030/

 

3.2.2 啓動hbase的命令行

cd  /opt/apps/hbase/hbase-1.2.6.1/bin

執行命令啓動Hbase命令行窗口,命令是:

./hbase  shell
-------------------------------------------------------------------
2017-05-15 17:52:55,411 WARN  [main] util.NativeCodeLoader: Unable to loadnative-hadoop library for your platform... using builtin-java classes whereapplicable
SLF4J: Class path contains multiple SLF4Jbindings.
SLF4J: Found binding in[jar:file:/opt/hbase/hbase-1.2.5/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in[jar:file:/opt/hadoop/hadoop-2.8.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Seehttp://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type[org.slf4j.impl.Log4jLoggerFactory]
HBase Shell; enter 'help<RETURN>' forlist of supported commands.
Type "exit<RETURN>" toleave the HBase Shell
Version 1.2.5,rd7b05f79dee10e0ada614765bb354b93d615a157, Wed Mar  1 00:34:48 CST 2017

#這裏執行list不報錯說明安裝完成
hbase(main):001:0> list

-------------------------------------------------------------------



這裏是因爲hbase跟hadoopjar包版本不一致,隨便刪除一個就可以了

我這裏刪除了hbasejar包

rm -rf /opt/hbase/hbase-1.2.5/lib/slf4j-log4j12-1.7.5.jar

hive安裝配置

一 環境準備

1.1 下載hive

下載地址:http://hive.apache.org/downloads.html 

點擊上圖的Download release now!  

1.2 創建文件夾

mkdir -p /opt/apps/hive

1.3 上傳、解壓

上傳到/opt/resource目錄中

tar -zxvf apache-hive-2.3.3-bin.tar.gz
mv apache-hive-2.3.3-bin /opt/apps/hive/apache-hive-2.3.3-bin

1.4 配置環境變量

 vi /etc/profile
 -------------------------------------------------------------------
 新增
export HIVE_HOME=/opt/apps/hive/apache-hive-2.3.3-bin
export HIVE_CONF_DIR=$HIVE_HOME/conf
export CLASSPATH=$HIVE_HOME/lib:$CLASSPATH
export PATH=$HIVE_HOME/bin:$PATH
-------------------------------------------------------------------
 source /etc/profile
 
 cat /etc/profile
 
-------------------------------------------------------------------
 export JAVA_HOME=/opt/jdk8
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$JAVA_HOME/bin:$PATH


export HADOOP_HOME=/opt/apps/hadoop-2.8.0
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

export ZOOKEEPER_HOME=/opt/apps/zookeeper/zookeeper-3.4.10
export PATH=$ZOOKEEPER_HOME/bin:$ZOOKEEPER_HOME/conf:$PATH

export HBASE_HOME=/opt/apps/hbase/hbase-1.2.6.1
export PATH=$HBASE_HOME/bin:$PATH


export HIVE_HOME=/opt/apps/hive/apache-hive-2.3.3-bin
export HIVE_CONF_DIR=$HIVE_HOME/conf
export CLASSPATH=$HIVE_HOME/lib:$CLASSPATH
export PATH=$HIVE_HOME/bin:$PATH
-------------------------------------------------------------------

二 配置hive

2.1 使用hadoop新建hdfs目錄

因爲在hive-site.xml中有這樣的配置:

-------------------------------------------------------------------
  <name>hive.metastore.warehouse.dir</name>

  <value>/user/hive/warehouse</value>

   <name>hive.exec.scratchdir</name>

  <value>/tmp/hive</value>
-------------------------------------------------------------------

所以要讓hadoop新建/user/hive/warehouse目錄,執行命令:

$HADOOP_HOME/bin/hadoop   fs   -mkdir   -p   /user/hive/warehouse

給剛纔新建的目錄賦予讀寫權限,執行命令:

$HADOOP_HOME/bin/hadoop   fs   -chmod   777   /user/hive/warehouse 

讓hadoop新建/tmp/hive/目錄,執行命令:

$HADOOP_HOME/bin/hadoop   fs   -mkdir  -p   /tmp/hive/

給剛纔新建的目錄賦予讀寫權限,執行命令:

$HADOOP_HOME/bin/hadoop   fs   -chmod  777   /tmp/hive

檢查hdfs目錄是否創建成功

$HADOOP_HOME/bin/hadoop   fs   -ls   /user/hive/
$HADOOP_HOME/bin/hadoop   fs   -ls   /tmp/

 

2.2 hive-site.xml相關的配置

cd   /opt/apps/hive/apache-hive-2.1.1-bin/conf

將hive-default.xml.template文件複製一份,並且改名爲hive-site.xml,命令是:

cp   hive-default.xml.template   hive-site.xml

將hive-site.xml文件中的${system:java.io.tmpdir}替換爲hive的臨時目錄,例如我替換爲/opt/hive/tmp,該目錄如果不存在則要自己手工創建,並且賦予讀寫權限。文本查找可以使用: /查找內容,按n查找下一個 例如:/system:java.io.tmpdir
如圖:

被我替換爲了如圖: 

將${system:user.name}都替換爲root,如圖: 

被替換爲: 

說明:截圖並不完整,只是截取了幾處以作舉例,你在替換時候要認真仔細的全部替換掉。

修改hive-site.xml數據庫相關的配置
搜索javax.jdo.option.ConnectionURL,將該name對應的value修改爲MySQL的地址,找一臺安裝有mysql的機器,填寫mysql相關信息例如我修改後是:

-------------------------------------------------------------------
     <name>javax.jdo.option.ConnectionURL</name>  

     <value>jdbc:mysql://192.168.1.8:3306/hive?createDatabaseIfNotExist=true</value>
-------------------------------------------------------------------

搜索javax.jdo.option.ConnectionDriverName,將該name對應的value修改爲MySQL驅動類路徑,例如我的修改後是:

-------------------------------------------------------------------
 <property> 

        <name>javax.jdo.option.ConnectionDriverName</name> 

        <value>com.mysql.jdbc.Driver</value> 

 </property>      
-------------------------------------------------------------------

搜索javax.jdo.option.ConnectionUserName,將對應的value修改爲MySQL數據庫登錄名:

-------------------------------------------------------------------
 <name>javax.jdo.option.ConnectionUserName</name>

    <value>root</value>
-------------------------------------------------------------------

搜索javax.jdo.option.ConnectionPassword,將對應的value修改爲MySQL數據庫的登錄密碼:

-------------------------------------------------------------------
 <name>javax.jdo.option.ConnectionPassword</name>

     <value>123456</value>
 
--------------------------------------------------------------------

搜索hive.metastore.schema.verification,將對應的value修改爲false:

-------------------------------------------------------------------
  <name>hive.metastore.schema.verification</name>

     <value>false</value>
-------------------------------------------------------------------

2.3 將MySQL驅動包上載到lib目錄

將MySQL驅動包上載到Hive的lib目錄下,例如我是上載到/opt/apps/hive/apache-hive-2.1.1-bin/lib目錄下。

2.4 新建hive-env.sh文件並進行修改

cd    /opt/apps/hive/apache-hive-2.1.1-bin/conf

cp    hive-env.sh.template    hive-env.sh

vi  hive-env.sh
-------------------------------------------------------------------
添加
export  HADOOP_HOME=/opt/hadoop/hadoop-2.8.0

export  HIVE_CONF_DIR=/opt/hive/apache-hive-2.1.1-bin/conf

export  HIVE_AUX_JARS_PATH=/opt/hive/apache-hive-2.1.1-bin/lib

-------------------------------------------------------------------

三 啓動

3.1 MySQL數據庫進行初始化

 cd   /opt/apps/hive/apache-hive-2.1.1-bin/bin

對數據庫進行初始化,執行命令:
[root@nodea bin]# schematool -initSchema -dbType mysql
-------------------------------------------------------------------
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/apps/hive/apache-hive-2.3.3-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/apps/hadoop-2.8.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL:        jdbc:mysql://192.168.1.51:3306/hive?createDatabaseIfNotExist=true
Metastore Connection Driver :    com.mysql.jdbc.Driver
Metastore connection User:       root
Starting metastore schema initialization to 2.3.0
Initialization script hive-schema-2.3.0.mysql.sql
Initialization script completed
schemaTool completed
-------------------------------------------------------------------

執行成功後,hive數據庫裏已經有一堆表創建好了 

3.2 啓動hive

[root@nodea bin]# cd /opt/apps/hive/apache-hive-2.3.3-bin/bin
[root@nodea bin]# ./hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/apps/hive/apache-hive-2.3.3-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/apps/hadoop-2.8.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/opt/apps/hive/apache-hive-2.3.3-bin/lib/hive-common-2.3.3.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> show functions
    > ;
OK
!
!=
$sum0
%
&
*
+
-
/
<
<=
<=>
<>
=
==
>
>=
^
abs
acos
add_months
hive> desc function sum;
OK
sum(x) - Returns the sum of a set of numbers
Time taken: 0.463 seconds, Fetched: 1 row(s)
hive> create database db_hive_edu;
OK
Time taken: 0.417 seconds
hive> use db_hive_edu;
OK
Time taken: 0.125 seconds

四 測試

4.1 執行簡單測試命令

#執行查看sum函數的詳細信息的命令:
hive> desc function sum;
OK
sum(x) - Returns the sum of a set of numbers
Time taken: 0.463 seconds, Fetched: 1 row(s)
##新建數據庫
hive> create database db_hive_edu;
OK
Time taken: 0.417 seconds
hive> use db_hive_edu;
OK
Time taken: 0.125 seconds
#新建數據表
hive> create table student(id int,name string) row format delimited fields terminated by '\t';
OK
Time taken: 1.485 seconds

4.2 將文件數據寫入表中

執行Linux命令(最好是重新打開一個終端來執行):

   touch    /opt/apps/hive/student.txt
   vi /opt/apps/hive/student.txt
-------------------------------------------------------------------
    往文件中添加以下內容:列之間用tab間隔

001     zhangsan
002     lisi
003     wangwu
004     zhaoliu
005     chenqi
-------------------------------------------------------------------

說明:ID和name直接是TAB鍵,不是空格,因爲在上面創建表的語句中用了terminated by '\t'所以這個文本里id和name的分割必須是用TAB鍵(複製粘貼如果有問題,手動敲TAB鍵吧),還有就是行與行之間不能有空行,否則下面執行load,會把NULL存入表內,該文件要使用unix格式,如果是在windows上用txt文本編輯器編輯後在上載到服務器上,需要用工具將windows格式轉爲unix格式,例如可以使用Notepad++來轉換

hive> load data local inpath '/opt/apps/hive/student.txt' into table db_hive_edu.student;
Loading data to table db_hive_edu.student
OK
Time taken: 2.45 seconds
hive> select * from student;
OK
1       liyan
2       hld
3       ylq
4       ps
5       ss
6       hello
Time taken: 2.01 seconds, Fetched: 6 row(s)

4.3 在界面上查看剛纔寫入hdfs的數據

http://nodea:50070/explorer.html#/user/hive/warehouse

 

4.4 在MySQL的hive數據庫中查看

在MySQL數據庫中執行select語句,查看hive創建的表,SQL是:

SELECT  * FROM  hive.TBLS

如圖:

五 錯誤和解決

5.1 警告Unable to load native-hadoop library for yourplatform

實際上其實這個警告可以不予理會。

5.2 There are 2 datanode(s) running and 2 node(s) areexcluded in this operation.

報錯信息

-------------------------------------------------------------------
hive> load data local inpath '/opt/hive/student.txt' intotable  db_hive_edu.student;

Loading data to table db_hive_edu.student

Failed with exceptionorg.apache.hadoop.ipc.RemoteException(java.io.IOException): File/user/hive/warehouse/db_hive_edu.db/student/student_copy_2.txt could only bereplicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and 2 node(s)are excluded in this operation.

        atorg.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1559)

        atorg.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3245)

        atorg.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:663)

        atorg.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:482)

        atorg.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

        atorg.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)

        atorg.apache.hadoop.ipc.RPC$Server.call(RPC.java:975)

        atorg.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040)

        atorg.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2036)

        atjava.security.AccessController.doPrivileged(Native Method)

        atjavax.security.auth.Subject.doAs(Subject.java:422)

        atorg.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)

        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2034)

 

FAILED: Execution Error, return code 1 fromorg.apache.hadoop.hive.ql.exec.MoveTask.org.apache.hadoop.ipc.RemoteException(java.io.IOException): File/user/hive/warehouse/db_hive_edu.db/student/student_copy_2.txt could only bereplicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and 2 node(s)are excluded in this operation.

        atorg.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1559)

        atorg.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3245)

        atorg.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:663)

        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:482)

        atorg.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

        atorg.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)

        atorg.apache.hadoop.ipc.RPC$Server.call(RPC.java:975)

        atorg.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040)

        atorg.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2036)

        atjava.security.AccessController.doPrivileged(Native Method)

        atjavax.security.auth.Subject.doAs(Subject.java:422)

        atorg.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)

        atorg.apache.hadoop.ipc.Server$Handler.run(Server.java:2034)

 
-------------------------------------------------------------------

原因和解決:

原因是你的hadoop中的datanode有問題,沒發寫入數據,請檢查你的hadoop是否正常運行,看是否能正常訪問http://nodename的IP地址:50070

如果不正常,請回頭檢查自己hadoop的安裝配置是否正確,hive的安裝和配置是否正確。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章