HBase 官方文檔中文版

原文出自:http://abloz.com/hbase/book.html

Apache HBase™ 參考指南

HBase HBase 官方文檔中文版

Revision History
Revision 0.95-SNAPSHOT 2012-12-03T13:38
中文版翻譯整理 周海漢


譯者:HBase新版 0.95 文檔和0.90版相比,變化較大,文檔補充更新了很多內容,章節調整較大。本翻譯文檔的部分工作基於顏開工作。英文原文地址在此處。舊版0.90版由顏開翻譯文檔在此處。0.95版翻譯最後更新請到此處http://abloz.com/hbase/book.html )瀏覽。反饋和參與請到此處 (https://code.google.com/p/hbasedoc-cn/)或訪問我的blog(http://abloz.com),或給我發email。

最終版生成pdf供下載。

貢獻者:

周海漢郵箱:[email protected], 網址:http://abloz.com/
顏開郵箱: [email protected], 網址:http://www.yankay.com/


摘要

這是 Apache HBase (TM)的官方文檔。 HBase是一個分佈式,版本化,面向列的數據庫,構建在 Apache Hadoop和 Apache ZooKeeper之上。

 


目錄

1. 入門
1.1. 介紹
1.2. 快速開始
2. Apache HBase (TM)配置
2.1. 基礎條件
2.2. HBase 運行模式: 獨立和分佈式
2.3. 配置文件
2.4. 配置示例
2.5. 重要配置
 
3. 升級
3.1. 從 0.94.x 升級到 0.96.x
3.2. 從 0.92.x 升級到 0.94.x
3.3. 從 0.90.x 升級到 0.92.x
3.4. 從0.20x或0.89x升級到0.90.x
 
4. HBase Shell
4.1. 使用腳本
4.2. Shell 技巧
5. 數據模型
5.1. 概念視圖
5.2. 物理視圖
5.3. 
5.4. 
5.5. 列族
5.6. Cells
5.7. Data Model Operations
5.8. 版本
5.9. 排序
5.10. 列元數據
5.11. Joins
6. HBase 和 Schema 設計
6.1. Schema 創建
6.2. column families的數量
6.3. Rowkey 設計
6.4. 版本數量
6.5. 支持的數據類型
6.6. Joins
6.7. 生存時間 (TTL)
6.8. 保留刪除的單元
6.9. 第二索引和替代查詢路徑
6.10. 模式設計對決
6.11. 操作和性能配置選項
6.12. 限制
7. HBase 和 MapReduce
7.1. Map-Task 分割
7.2. HBase MapReduce 示例
7.3. 在MapReduce工作中訪問其他 HBase 表
7.4. 推測執行
8. HBase安全
8.1. 安全客戶端訪問 HBase
8.2. 訪問控制
9. 架構
9.1. 概述
9.2. 目錄表
9.3. 客戶端
9.4. 客戶請求過濾器
9.5. Master
9.6. RegionServer
9.7. 分區(Regions)
9.8. 批量加載
9.9. HDFS
10. 外部 APIs
10.1. 非Java語言和 JVM交互
10.2. REST
10.3. Thrift
10.4. C/C++ Apache HBase Client
11. 性能調優
11.1. 操作系統
11.2. 網絡
11.3. Java
11.4. HBase 配置
11.5. ZooKeeper
11.6. Schema 設計
11.7. 寫到 HBase
11.8. 從 HBase讀取
11.9. 從 HBase刪除
11.10. HDFS
11.11. Amazon EC2
11.12. 案例
12. 故障排除和調試 HBase
12.1. 通用指引
12.2. Logs
12.3. 資源
12.4. 工具
12.5. 客戶端
12.6. MapReduce
12.7. NameNode
12.8. 網絡
12.9. RegionServer
12.10. Master
12.11. ZooKeeper
12.12. Amazon EC2
12.13. HBase 和 Hadoop 版本相關
12.14. 案例
13. 案例研究
13.1. 概要
13.2. Schema 設計
13.3. 性能/故障排除
14. HBase 運維管理
14.1. HBase 工具和實用程序
14.2. 分區管理
14.3. 節點管理
14.4. HBase 度量(Metrics)
14.5. HBase 監控
14.6. Cluster 複製
14.7. HBase 備份
14.8. 容量規劃
15. 創建和開發 HBase
15.1. HBase 倉庫
15.2. IDEs
15.3. 創建 HBase
15.4. 添加 Apache HBase 發行版到Apache的 Maven Repository
15.5. 更新 hbase.apache.org
15.6. 測試
15.7. Maven 創建命令
15.8. 加入
15.9. 開發
15.10. 提交補丁
16. ZooKeeper
16.1. Using existing ZooKeeper ensemble
16.2. SASL Authentication with ZooKeeper
17. Community
17.1. Decisions
17.2. Community Roles
A. FAQ
B. 深入hbck
B.1. 運行 hbck 以查找不一致
B.2. 不一致(Inconsistencies)
B.3. 局部修補
B.4. 分區重疊修補
C. HBase中的壓縮
C.1. CompressionTest 工具
C.2. hbase.regionserver.codecs
C.3. LZO
C.4. GZIP
C.5. SNAPPY
C.6. 修改壓縮 Schemes
D. YCSB: Yahoo! 雲服務評估和 HBase
 
E. HFile 格式版本 2
E.1. Motivation
E.2. HFile 格式版本 1 概覽
E.3. HBase 文件格式帶 inline blocks (version 2)
F. HBase的其他信息
F.1. HBase 視頻
F.2. HBase 展示 (Slides)
F.3. HBase 論文
F.4. HBase 網站
F.5. HBase 書籍
F.6. Hadoop 書籍
G. HBase 歷史
 
H. HBase 和 Apache 軟件基金會(ASF)
H.1. ASF開發進程
H.2. ASF 報告板
I. Enabling Dapper-like Tracing in HBase
I.1. SpanReceivers
I.2. Client Modifications
詞彙表
 

這本書是 HBase 的官方指南。 版本爲 0.95-SNAPSHOT 。可以在HBase官網上找到它。也可以在 javadocJIRA 和 wiki 找到更多的資料。

此書正在編輯中。 可以向 HBase 官方提供補丁JIRA.

這個版本系譯者水平限制,沒有理解清楚或不需要翻譯的地方保留英文原文。

最前面的話

若這是你第一次踏入分佈式計算的精彩世界,你會感到這是一個有趣的年代。分佈式計算是很難的,做一個分佈式系統需要很多軟硬件和網絡的技能。你的集羣可以會因爲各式各樣的錯誤發生故障。比如HBase本身的Bug,錯誤的配置(包括操作系統),硬件的故障(網卡和磁盤甚至內存) 如果你一直在寫單機程序的話,你需要重新開始學習。這裏就是一個好的起點: 分佈式計算的謬論.

Chapter 1. 入門

1.1. 介紹

Section 1.2, “快速開始”會介紹如何運行一個單機版的HBase.他運行在本地磁盤上。 Section 2, “配置” 會介紹如何運行一個分佈式的HBase。他運行在HDFS上

1.2. 快速開始

本指南介紹了在單機安裝HBase的方法。會引導你通過shell創建一個表,插入一行,然後刪除它,最後停止HBase。只要10分鐘就可以完成以下的操作。

1.2.1. 下載解壓最新版本

選擇一個 Apache 下載鏡像,下載 HBase Releases. 點擊 stable目錄,然後下載後綴爲 .tar.gz 的文件; 例如 hbase-0.95-SNAPSHOT.tar.gz.

解壓縮,然後進入到那個要解壓的目錄.

$ tar xfz hbase-0.95-SNAPSHOT.tar.gz $ cd hbase-0.95-SNAPSHOT

現在你已經可以啓動HBase了。但是你可能需要先編輯 conf/hbase-site.xml 去配置hbase.rootdir,來選擇HBase將數據寫到哪個目錄 .

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hbase.rootdir</name> <value>file:///DIRECTORY/hbase</value> </property> </configuration>

將 DIRECTORY 替換成你期望寫文件的目錄. 默認 hbase.rootdir 是指向 /tmp/hbase-${user.name} ,也就說你會在重啓後丟失數據(重啓的時候操作系統會清理/tmp目錄)

1.2.2. 啓動 HBase

現在啓動HBase:

$ ./bin/start-hbase.sh starting Master, logging to logs/hbase-user-master-example.org.out

現在你運行的是單機模式的Hbaes。所有的服務都運行在一個JVM上,包括HBase和Zookeeper。HBase的日誌放在logs目錄,當你啓動出問題的時候,可以檢查這個日誌。

是否安裝了 java ?

你需要確認安裝了Oracle的1.6 版本的java.如果你在命令行鍵入java有反應說明你安裝了Java。如果沒有裝,你需要先安裝,然後編輯conf/hbase-env.sh,將其中的JAVA_HOME指向到你Java的安裝目錄。

1.2.3. Shell 練習

shell連接你的HBase

$ ./bin/hbase shell HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version: 0.90.0, r1001068, Fri Sep 24 13:55:42 PDT 2010 hbase(main):001:0>

輸入 help 然後 <RETURN> 可以看到一列shell命令。這裏的幫助很詳細,要注意的是表名,行和列需要加引號。

創建一個名爲 test 的表,這個表只有一個 列族 爲 cf。可以列出所有的表來檢查創建情況,然後插入些值。

hbase(main):003:0> create 'test', 'cf' 0 row(s) in 1.2200 seconds hbase(main):003:0> list 'table' test 1 row(s) in 0.0550 seconds hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value1' 0 row(s) in 0.0560 seconds hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2' 0 row(s) in 0.0370 seconds hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3' 0 row(s) in 0.0450 seconds

以上我們分別插入了3行。第一個行key爲row1, 列爲 cf:a, 值是 value1。HBase中的列是由 列族前綴和列的名字組成的,以冒號間隔。例如這一行的列名就是a.

檢查插入情況.

Scan這個表,操作如下

hbase(main):007:0> scan 'test' ROW COLUMN+CELL row1 column=cf:a, timestamp=1288380727188, value=value1 row2 column=cf:b, timestamp=1288380738440, value=value2 row3 column=cf:c, timestamp=1288380747365, value=value3 3 row(s) in 0.0590 seconds

Get一行,操作如下

hbase(main):008:0> get 'test', 'row1' COLUMN CELL cf:a timestamp=1288380727188, value=value1 1 row(s) in 0.0400 seconds

disable 再 drop 這張表,可以清除你剛剛的操作

hbase(main):012:0> disable 'test' 0 row(s) in 1.0930 seconds hbase(main):013:0> drop 'test' 0 row(s) in 0.0770 seconds

關閉shell

hbase(main):014:0> exit

1.2.4. 停止 HBase

運行停止腳本來停止HBase.

$ ./bin/stop-hbase.sh stopping hbase...............

1.2.5. 下一步該做什麼

以上步驟僅僅適用於實驗和測試。接下來你可以看 Section 2., “配置” ,我們會介紹不同的HBase運行模式,運行分佈式HBase中需要的軟件 和如何配置。

2. 配置

 

本章是慢速開始配置指導。

HBase有如下需要,請仔細閱讀本章節以確保所有的需要都被滿足。如果需求沒有能滿足,就有可能遇到莫名其妙的錯誤甚至丟失數據。

HBase使用和Hadoop一樣配置系統。要配置部署,編輯conf/hbase-env.sh文件中的環境變量——該配置文件主要啓動腳本用於獲取已啓動的集羣——然後增加配置到XML文件,如同覆蓋HBase缺省配置,告訴HBase用什麼文件系統, 全部ZooKeeper位置 [1] 。

在分佈模式下運行時,在編輯HBase配置文件之後,確認將conf目錄複製到集羣中的每個節點。HBase不會自動同步。使用rsync.

[1] 小心編輯XML。確認關閉所有元素。採用 xmllint 或類似工具確認文檔編輯後是良好格式化的。

2.1. 基礎條件

This section lists required services and some required system configuration.

2.1.1 java

和Hadoop一樣,HBase需要Oracle版本的Java6.除了那個有問題的u18版本其他的都可以用,最好用最新的。

2.1. 操作系統

2.1.2.1. ssh

必須安裝ssh , sshd 也必須運行,這樣Hadoop的腳本纔可以遠程操控其他的Hadoop和HBase進程。ssh之間必須都打通,不用密碼都可以登錄,詳細方法可以Google一下 ("ssh passwordless login").

2.1.2.2. DNS

HBase使用本地 hostname 才獲得IP地址. 正反向的DNS都是可以的.

如果你的機器有多個接口,HBase會使用hostname指向的主接口.

如果還不夠,你可以設置 hbase.regionserver.dns.interface 來指定主接口。當然你的整個集羣的配置文件都必須一致,每個主機都使用相同的網絡接口

還有一種方法是設置 hbase.regionserver.dns.nameserver來指定nameserver,不使用系統帶的.

2.1.2.3. Loopback IP

HBase expects the loopback IP address to be 127.0.0.1. Ubuntu and some other distributions, for example, will default to 127.0.1.1 and this will cause problems for you.

/etc/hosts should look something like this:

127.0.0.1 localhost 127.0.0.1 ubuntu.ubuntu-domain ubuntu

2.1.2.4. NTP

集羣的時鐘要保證基本的一致。稍有不一致是可以容忍的,但是很大的不一致會造成奇怪的行爲。 運行 NTP 或者其他什麼東西來同步你的時間.

如果你查詢的時候或者是遇到奇怪的故障,可以檢查一下系統時間是否正確!

2.1.2.5.  ulimit 和 nproc

HBase是數據庫,會在同一時間使用很多的文件句柄。大多數linux系統使用的默認值1024是不能滿足的,會導致FAQ: Why do I see "java.io.IOException...(Too many open files)" in my logs?異常。還可能會發生這樣的異常

2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Exception increateBlockOutputStream java.io.EOFException 2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-6935524980745310745_1391901

所以你需要修改你的最大文件句柄限制。可以設置到10k. 你還需要修改 hbase 用戶的 nproc,如果過低會造成 OutOfMemoryError異常。 [2] [3].

需要澄清的,這兩個設置是針對操作系統的,不是HBase本身的。有一個常見的錯誤是HBase運行的用戶,和設置最大值的用戶不是一個用戶。在HBase啓動的時候,第一行日誌會現在ulimit信息,所以你最好檢查一下。 [4]

2.1.2.5.1. 在Ubuntu上設置ulimit

如果你使用的是Ubuntu,你可以這樣設置:

在文件 /etc/security/limits.conf 添加一行,如:

hadoop - nofile 32768

可以把 hadoop 替換成你運行HBase和Hadoop的用戶。如果你用兩個用戶,你就需要配兩個。還有配nproc hard 和 soft limits. 如:

hadoop soft/hard nproc 32000

.

在 /etc/pam.d/common-session 加上這一行:

session required pam_limits.so

否則在 /etc/security/limits.conf上的配置不會生效.

還有註銷再登錄,這些配置才能生效!

2.1.2.6. Windows

HBase沒有怎麼在Windows下測試過。所以不推薦在Windows下運行.

如果你實在是想運行,需要安裝Cygwin 並虛擬一個unix環境.詳情請看 Windows 安裝指導 . 或者 搜索郵件列表找找最近的關於windows的注意點

2.1.3. hadoop

請完整閱讀本節:

請閱讀本節到末尾. 目前我們費力於 Hadoop 各版本. 以後,我們談及HBase必須如何做才能在特定Hadoop版本中工作 。

除非運行在實現了持久化同步(sync)的HDFS上,HBase 將丟失所有數據。 Hadoop 0.20.2, Hadoop 0.20.203.0,及 Hadoop 0.20.204.0 不具有上述特性。當前Hadoop僅在Hadoop 0.20.205.x 或更高版本--包含hadoop 1.0.0 --具有持久化sync. Sync 必須顯式開啓。即 dfs.support.append 同時在客戶端和服務器端設爲真,客戶端: hbase-site.xml ,服務器端: hdfs-site.xml ( HBase需要的同步措施是一個附加代碼路徑的子集)

<property> <name>dfs.support.append</name> <value>true</value> </property>

修改後必須重啓集羣。忽略部分註釋,可以在 hdfs-default.xml 找到thedfs.support.append 配置描述;描述說沒有啓用是因爲 “... bugs in the 'append code' and is not supported in any production cluster.”. 該註釋已經過時。我確信有bug, sync/append 代碼已經被運行於大容量產品部署,並且很多商業產品的Hadoop缺省是打開的。 [7] [8][9].

你還可以用 Cloudera的 CDH3 或  MapR 。 Cloudera 的CDH3 是Apache hadoop 0.20.x的補丁增強,包含所有  branch-0.20-append  附加的持久化Sync. 使用最新的產品化版的 CDH3.

MapR 包含一個商業的, 重新實現的 HDFS. 具有持久化 sync及一些有趣特性,是現在 Apache Hadoop不具有的.  M3 產品免費並且無限制。

因爲HBase建立在Hadoop之上,所以他用到了hadoop.jar,這個Jar在 lib 裏面。這個jar是hbase自己打了branch-0.20-append 補丁的hadoop.jar. Hadoop使用的hadoop.jar和HBase使用的 必須 一致。所以你需要將 HBase lib 目錄下的hadoop.jar替換成Hadoop裏面的那個,防止版本衝突。比方說CDH的版本沒有HDFS-724而branch-0.20-append裏面有,這個HDFS-724補丁修改了RPC協議。如果不替換,就會有版本衝突,繼而造成嚴重的出錯,Hadoop會看起來掛了。

Packaging and Apache BigTop

Apache Bigtop is an umbrella for packaging and tests of the Apache Hadoop ecosystem, including Apache HBase. Bigtop performs testing at various levels (packaging, platform, runtime, upgrade, etc...), developed by a community, with a focus on the system as a whole, rather than individual projects. We recommend installing Apache HBase packages as provided by a Bigtop release rather than rolling your own piecemeal integration of various component releases.

 

2.1.3.1. Hadoop 安全性

HBase運行在Hadoop 0.20.x上,就可以使用其中的安全特性 -- 只要你用這兩個版本0.20S 和CDH3B3,然後把hadoop.jar替換掉就可以了.

2.1.3.2. dfs.datanode.max.xcievers

一個 Hadoop HDFS Datanode 有一個同時處理文件的上限. 這個參數叫 xcievers (Hadoop的作者把這個單詞拼錯了). 在你加載之前,先確認下你有沒有配置這個文件conf/hdfs-site.xml裏面的xceivers參數,至少要有4096:

<property> <name>dfs.datanode.max.xcievers</name> <value>4096</value> </property>

對於HDFS修改配置要記得重啓.

如果沒有這一項配置,你可能會遇到奇怪的失敗。你會在Datanode的日誌中看到xcievers exceeded,但是運行起來會報 missing blocks錯誤。例如: 10/12/08 20:10:31 INFO hdfs.DFSClient: Could not obtain block blk_XXXXXXXXXXXXXXXXXXXXXX_YYYYYYYY from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry... [5]

2.2. HBase運行模式:單機和分佈式

HBase有兩個運行模式: Section 2.4.1, “單機模式” 和 Section 2.4.2, “分佈式模式”. 默認是單機模式,如果要分佈式模式你需要編輯 conf 文件夾中的配置文件.

不管是什麼模式,你都需要編輯 conf/hbase-env.sh來告知HBase java的安裝路徑.在這個文件裏你還可以設置HBase的運行環境,諸如 heapsize和其他 JVM有關的選項, 還有Log文件地址,等等. 設置 JAVA_HOME指向 java安裝的路徑.

2.21. 單機模式

這是默認的模式,在 Section 1.2, “快速開始” 一章中介紹的就是這個模式. 在單機模式中,HBase使用本地文件系統,而不是HDFS ,所有的服務和zooKeeper都運作在一個JVM中。zookeep監聽一個端口,這樣客戶端就可以連接HBase了。

2.2.2. 分佈式模式

分佈式模式分兩種。僞分佈式模式是把進程運行在一臺機器上,但不是一個JVM.而完全分佈式模式就是把整個服務被分佈在各個節點上了 [6].

分佈式模式需要使用 Hadoop Distributed File System (HDFS).可以參見 HDFS需求和指導來獲得關於安裝HDFS的指導。在操作HBase之前,你要確認HDFS可以正常運作。

在我們安裝之後,你需要確認你的僞分佈式模式或者 完全分佈式模式的配置是否正確。這兩個模式可以使用同一個驗證腳本Section 2.2.3, “運行和確認你的安裝”

2.2.2.1. 僞分佈式模式

僞分佈式模式是一個相對簡單的分佈式模式。這個模式是用來測試的。不能把這個模式用於生產環節,也不能用於測試性能。

你確認HDFS安裝成功之後,就可以先編輯 conf/hbase-site.xml。在這個文件你可以加入自己的配置,這個配置會覆蓋 Section 2.6.1.1, “HBase 默認配置” andSection 2.2.2.2.3, “HDFS客戶端配置”. 運行HBase需要設置hbase.rootdir 屬性.該屬性是指HBase在HDFS中使用的目錄的位置。例如,要想 /hbase 目錄,讓namenode 監聽locahost的9000端口,只有一份數據拷貝(HDFS默認是3份拷貝)。可以在 hbase-site.xml 寫上如下內容

<configuration> ... <property> <name>hbase.rootdir</name> <value>hdfs://localhost:9000/hbase</value> <description>The directory shared by RegionServers. </description> </property> <property> <name>dfs.replication</name> <value>1</value> <description>The replication count for HLog & HFile storage. Should not be greater than HDFS datanode count. </description> </property> ... </configuration>

Note

讓HBase自己創建 hbase.rootdir

Note

上面我們綁定到 localhost. 也就是說除了本機,其他機器連不上HBase。所以你需要設置成別的,才能使用它。

現在可以跳到 Section 2.2.3, “運行和確認你的安裝” 來運行和確認你的僞分佈式模式安裝了。 [7]

2.2.2.1.1. 僞分佈模式配置文件

下面是僞分佈模式設置的配置文件示例。

hdfs-site.xml
<configuration> ... <property> <name>dfs.name.dir</name> <value>/Users/local/user.name/hdfs-data-name</value> </property> <property> <name>dfs.data.dir</name> <value>/Users/local/user.name/hdfs-data</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> ... </configuration>
hbase-site.xml
<configuration> ... <property> <name>hbase.rootdir</name> <value>hdfs://localhost:8020/hbase</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>localhost</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> ... </configuration>
2.2.2.1.2. 僞分佈模式附加
2.2.2.1.2.1. 啓動

啓動初始 HBase 集羣...

% bin/start-hbase.sh

在同一服務器啓動額外備份主服務器

% bin/local-master-backup.sh start 1

... '1' 表示使用端口 60001 & 60011, 該備份主服務器及其log文件放在logs/hbase-${USER}-1-master-${HOSTNAME}.log.

啓動多個備份主服務器...

% bin/local-master-backup.sh start 2 3

可以啓動到 9 個備份服務器 (總數10 個).

啓動更多 regionservers...

% bin/local-regionservers.sh start 1

'1' 表示使用端口 60201 & 60301 ,log文件在 logs/hbase-${USER}-1-regionserver-${HOSTNAME}.log.

在剛運行的regionserver上增加 4 個額外 regionservers ...

% bin/local-regionservers.sh start 2 3 4 5

支持到 99 個額外regionservers (總100個).

2.2.2.1.2.2. 停止

假設想停止備份主服務器 # 1, 運行...

% cat /tmp/hbase-${USER}-1-master.pid |xargs kill -9

注意 bin/local-master-backup.sh 停止 1 會嘗試停止主服務器相關集羣。

停止單獨 regionserver, 運行...

% bin/local-regionservers.sh stop 1

2.2.2.2. 完全分佈式模式

要想運行完全分佈式模式,你要進行如下配置,先在 hbase-site.xml, 加一個屬性 hbase.cluster.distributed 設置爲 true 然後把 hbase.rootdir 設置爲HDFS的NameNode的位置。 例如,你的namenode運行在namenode.example.org,端口是9000 你期望的目錄是 /hbase,使用如下的配置

<configuration> ... <property> <name>hbase.rootdir</name> <value>hdfs://namenode.example.org:9000/hbase</value> <description>The directory shared by RegionServers. </description> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> <description>The mode the cluster will be in. Possible values are false: standalone and pseudo-distributed setups with managed Zookeeper true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh) </description> </property> ... </configuration>
2.2.2.2.1. regionservers

完全分佈式模式的還需要修改conf/regionservers. 在 Section 2.7.1.2, “regionservers 列出了你希望運行的全部 HRegionServer,一行寫一個host (就像Hadoop裏面的 slaves 一樣). 列在這裏的server會隨着集羣的啓動而啓動,集羣的停止而停止.

2.2.2.2.2. ZooKeeper 和 HBase

 

2.2.2.2.3. HDFS客戶端配置

如果你希望Hadoop集羣上做HDFS 客戶端配置 ,例如你的HDFS客戶端的配置和服務端的不一樣。按照如下的方法配置,HBase就能看到你的配置信息:

  • hbase-env.sh裏將HBASE_CLASSPATH環境變量加上HADOOP_CONF_DIR 。

  • ${HBASE_HOME}/conf下面加一個 hdfs-site.xml (或者 hadoop-site.xml) ,最好是軟連接

  • 如果你的HDFS客戶端的配置不多的話,你可以把這些加到 hbase-site.xml上面.

例如HDFS的配置 dfs.replication.你希望複製5份,而不是默認的3份。如果你不照上面的做的話,HBase只會複製3份。

2.2.3. 運行和確認你的安裝

首先確認你的HDFS是運行着的。你可以運行HADOOP_HOME中的 bin/start-hdfs.sh 來啓動HDFS.你可以通過put命令來測試放一個文件,然後有get命令來讀這個文件。通常情況下HBase是不會運行mapreduce的。所以比不需要檢查這些。

如果你自己管理ZooKeeper集羣,你需要確認它是運行着的。如果是HBase託管,ZoopKeeper會隨HBase啓動。

用如下命令啓動HBase:

bin/start-hbase.sh
這個腳本在HBASE_HOME目錄裏面。

你現在已經啓動HBase了。HBase把log記在 logs 子目錄裏面. 當HBase啓動出問題的時候,可以看看Log.

HBase也有一個界面,上面會列出重要的屬性。默認是在Master的60010端口上H (HBase RegionServers 會默認綁定 60020端口,在端口60030上有一個展示信息的界面 ).如果Master運行在 master.example.org,端口是默認的話,你可以用瀏覽器在 http://master.example.org:60010看到主界面. .

一旦HBase啓動,參見Section 1.2.3, “Shell 練習”可以看到如何建表,插入數據,scan你的表,還有disable這個表,最後把它刪掉。

可以在HBase Shell停止HBase

$ ./bin/stop-hbase.sh stopping hbase...............

停止操作需要一些時間,你的集羣越大,停的時間可能會越長。如果你正在運行一個分佈式的操作,要確認在HBase徹底停止之前,Hadoop不能停.



 

2.3. 配置文件

 

HBase的配置系統和Hadoop一樣。在conf/hbase-env.sh配置系統的部署信息和環境變量。 -- 這個配置會被啓動shell使用 -- 然後在XML文件裏配置信息,覆蓋默認的配置。告知HBase使用什麼目錄地址,ZooKeeper的位置等等信息。 [10] .

當你使用分佈式模式的時間,當你編輯完一個文件之後,記得要把這個文件複製到整個集羣的conf 目錄下。HBase不會幫你做這些,你得用 rsync.

2.3.1. hbase-site.xml 和 hbase-default.xml

正如Hadoop放置HDFS的配置文件hdfs-site.xml,HBase的配置文件是 conf/hbase-site.xml. 你可以在 Section 2.3.1.1, “HBase 默認配置”找到配置的屬性列表。你也可以看有代碼裏面的hbase-default.xml文件,他在src/main/resources目錄下。

不是所有的配置都在 hbase-default.xml出現.只要改了代碼,配置就有可能改變,所以唯一瞭解這些被改過的配置的辦法是讀源代碼本身。

要注意的是,要重啓集羣才能是配置生效。

2.3.1.1. HBase 默認配置

HBase 默認配置

該文檔是用hbase默認配置文件生成的,文件源是 hbase-default.xml

hbase.rootdir

這個目錄是region server的共享目錄,用來持久化HBase。URL需要是'完全正確'的,還要包含文件系統的scheme。例如,要表示hdfs中的'/hbase'目錄,namenode 運行在namenode.example.org的9090端口。則需要設置爲hdfs://namenode.example.org:9000/hbase。默認情況下HBase是寫到/tmp的。不改這個配置,數據會在重啓的時候丟失。

默認: file:///tmp/hbase-${user.name}/hbase

hbase.master.port

HBase的Master的端口.

默認: 60000

hbase.cluster.distributed

HBase的運行模式。false是單機模式,true是分佈式模式。若爲false,HBase和Zookeeper會運行在同一個JVM裏面。

默認: false

hbase.tmp.dir

本地文件系統的臨時文件夾。可以修改到一個更爲持久的目錄上。(/tmp會在重啓時清楚)

默認: /tmp/hbase-${user.name}

hbase.master.info.port

HBase Master web 界面端口. 設置爲-1 意味着你不想讓他運行。

默認: 60010

hbase.master.info.bindAddress

HBase Master web 界面綁定的端口

默認: 0.0.0.0

hbase.client.write.buffer

HTable客戶端的寫緩衝的默認大小。這個值越大,需要消耗的內存越大。因爲緩衝在客戶端和服務端都有實例,所以需要消耗客戶端和服務端兩個地方的內存。得到的好處是,可以減少RPC的次數。可以這樣估算服務器端被佔用的內存: hbase.client.write.buffer * hbase.regionserver.handler.count

默認: 2097152

hbase.regionserver.port

HBase RegionServer綁定的端口

默認: 60020

hbase.regionserver.info.port

HBase RegionServer web 界面綁定的端口 設置爲 -1 意味這你不想與運行 RegionServer 界面.

默認: 60030

hbase.regionserver.info.port.auto

Master或RegionServer是否要動態搜一個可以用的端口來綁定界面。當hbase.regionserver.info.port已經被佔用的時候,可以搜一個空閒的端口綁定。這個功能在測試的時候很有用。默認關閉。

默認: false

hbase.regionserver.info.bindAddress

HBase RegionServer web 界面的IP地址

默認: 0.0.0.0

hbase.regionserver.class

RegionServer 使用的接口。客戶端打開代理來連接region server的時候會使用到。

默認: org.apache.hadoop.hbase.ipc.HRegionInterface

hbase.client.pause

通常的客戶端暫停時間。最多的用法是客戶端在重試前的等待時間。比如失敗的get操作和region查詢操作等都很可能用到。

默認: 1000

hbase.client.retries.number

最大重試次數。例如 region查詢,Get操作,Update操作等等都可能發生錯誤,需要重試。這是最大重試錯誤的值。

默認: 10

hbase.client.scanner.caching

當調用Scanner的next方法,而值又不在緩存裏的時候,從服務端一次獲取的行數。越大的值意味着Scanner會快一些,但是會佔用更多的內存。當緩衝被佔滿的時候,next方法調用會越來越慢。慢到一定程度,可能會導致超時。例如超過了hbase.regionserver.lease.period。

默認: 1

hbase.client.keyvalue.maxsize

一個KeyValue實例的最大size.這個是用來設置存儲文件中的單個entry的大小上界。因爲一個KeyValue是不能分割的,所以可以避免因爲數據過大導致region不可分割。明智的做法是把它設爲可以被最大region size整除的數。如果設置爲0或者更小,就會禁用這個檢查。默認10MB。

默認: 10485760

hbase.regionserver.lease.period

客戶端租用HRegion server 期限,即超時閥值。單位是毫秒。默認情況下,客戶端必須在這個時間內發一條信息,否則視爲死掉。

默認: 60000

hbase.regionserver.handler.count

RegionServers受理的RPC Server實例數量。對於Master來說,這個屬性是Master受理的handler數量

默認: 10

hbase.regionserver.msginterval

RegionServer 發消息給 Master 時間間隔,單位是毫秒

默認: 3000

hbase.regionserver.optionallogflushinterval

將Hlog同步到HDFS的間隔。如果Hlog沒有積累到一定的數量,到了時間,也會觸發同步。默認是1秒,單位毫秒。

默認: 1000

hbase.regionserver.regionSplitLimit

region的數量到了這個值後就不會在分裂了。這不是一個region數量的硬性限制。但是起到了一定指導性的作用,到了這個值就該停止分裂了。默認是MAX_INT.就是說不阻止分裂。

默認: 2147483647

hbase.regionserver.logroll.period

提交commit log的間隔,不管有沒有寫足夠的值。

默認: 3600000

hbase.regionserver.hlog.reader.impl

HLog file reader 的實現.

默認: org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader

hbase.regionserver.hlog.writer.impl

HLog file writer 的實現.

默認: org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter

hbase.regionserver.thread.splitcompactcheckfrequency

region server 多久執行一次split/compaction 檢查.

默認: 20000

hbase.regionserver.nbreservationblocks

儲備的內存block的數量(譯者注:就像石油儲備一樣)。當發生out of memory 異常的時候,我們可以用這些內存在RegionServer停止之前做清理操作。

默認: 4

hbase.zookeeper.dns.interface

當使用DNS的時候,Zookeeper用來上報的IP地址的網絡接口名字。

默認: default

hbase.zookeeper.dns.nameserver

當使用DNS的時候,Zookeepr使用的DNS的域名或者IP 地址,Zookeeper用它來確定和master用來進行通訊的域名.

默認: default

hbase.regionserver.dns.interface

當使用DNS的時候,RegionServer用來上報的IP地址的網絡接口名字。

默認: default

hbase.regionserver.dns.nameserver

當使用DNS的時候,RegionServer使用的DNS的域名或者IP 地址,RegionServer用它來確定和master用來進行通訊的域名.

默認: default

hbase.master.dns.interface

當使用DNS的時候,Master用來上報的IP地址的網絡接口名字。

默認: default

hbase.master.dns.nameserver

當使用DNS的時候,RegionServer使用的DNS的域名或者IP 地址,Master用它來確定用來進行通訊的域名.

默認: default

hbase.balancer.period

Master執行region balancer的間隔。

默認: 300000

hbase.regions.slop

當任一regionserver有average + (average * slop)個region是會執行Rebalance

默認: 0

hbase.master.logcleaner.ttl

Hlog存在於.oldlogdir 文件夾的最長時間, 超過了就會被 Master 的線程清理掉.

默認: 600000

hbase.master.logcleaner.plugins

LogsCleaner服務會執行的一組LogCleanerDelegat。值用逗號間隔的文本表示。這些WAL/HLog cleaners會按順序調用。可以把先調用的放在前面。你可以實現自己的LogCleanerDelegat,加到Classpath下,然後在這裏寫下類的全稱。一般都是加在默認值的前面。

默認: org.apache.hadoop.hbase.master.TimeToLiveLogCleaner

hbase.regionserver.global.memstore.upperLimit

單個region server的全部memtores的最大值。超過這個值,一個新的update操作會被掛起,強制執行flush操作。

默認: 0.4

hbase.regionserver.global.memstore.lowerLimit

當強制執行flush操作的時候,當低於這個值的時候,flush會停止。默認是堆大小的 35% . 如果這個值和 hbase.regionserver.global.memstore.upperLimit 相同就意味着當update操作因爲內存限制被掛起時,會盡量少的執行flush(譯者注:一旦執行flush,值就會比下限要低,不再執行)

默認: 0.35

hbase.server.thread.wakefrequency

service工作的sleep間隔,單位毫秒。 可以作爲service線程的sleep間隔,比如log roller.

默認: 10000

hbase.hregion.memstore.flush.size

當memstore的大小超過這個值的時候,會flush到磁盤。這個值被一個線程每隔hbase.server.thread.wakefrequency檢查一下。

默認: 67108864

hbase.hregion.preclose.flush.size

當一個region中的memstore的大小大於這個值的時候,我們又觸發了close.會先運行“pre-flush”操作,清理這個需要關閉的memstore,然後將這個region下線。當一個region下線了,我們無法再進行任何寫操作。如果一個memstore很大的時候,flush操作會消耗很多時間。"pre-flush"操作意味着在region下線之前,會先把memstore清空。這樣在最終執行close操作的時候,flush操作會很快。

默認: 5242880

hbase.hregion.memstore.block.multiplier

如果memstore有hbase.hregion.memstore.block.multiplier倍數的hbase.hregion.flush.size的大小,就會阻塞update操作。這是爲了預防在update高峯期會導致的失控。如果不設上界,flush的時候會花很長的時間來合併或者分割,最壞的情況就是引發out of memory異常。(譯者注:內存操作的速度和磁盤不匹配,需要等一等。原文似乎有誤)

默認: 2

hbase.hregion.memstore.mslab.enabled

體驗特性:啓用memStore分配本地緩衝區。這個特性是爲了防止在大量寫負載的時候堆的碎片過多。這可以減少GC操作的頻率。(GC有可能會Stop the world)(譯者注:實現的原理相當於預分配內存,而不是每一個值都要從堆裏分配)

默認: false

hbase.hregion.max.filesize

最大HStoreFile大小。若某個Column families的HStoreFile增長達到這個值,這個Hegion會被切割成兩個。 Default: 256M.

默認: 268435456

hbase.hstore.compactionThreshold

當一個HStore含有多於這個值的HStoreFiles(每一個memstore flush產生一個HStoreFile)的時候,會執行一個合併操作,把這HStoreFiles寫成一個。這個值越大,需要合併的時間就越長。

默認: 3

hbase.hstore.blockingStoreFiles

當一個HStore含有多於這個值的HStoreFiles(每一個memstore flush產生一個HStoreFile)的時候,會執行一個合併操作,update會阻塞直到合併完成,直到超過了hbase.hstore.blockingWaitTime的值

默認: 7

hbase.hstore.blockingWaitTime

hbase.hstore.blockingStoreFiles所限制的StoreFile數量會導致update阻塞,這個時間是來限制阻塞時間的。當超過了這個時間,HRegion會停止阻塞update操作,不過合併還有沒有完成。默認爲90s.

默認: 90000

hbase.hstore.compaction.max

每個“小”合併的HStoreFiles最大數量。

默認: 10

hbase.hregion.majorcompaction

一個Region中的所有HStoreFile的major compactions的時間間隔。默認是1天。 設置爲0就是禁用這個功能。

默認: 86400000

hbase.mapreduce.hfileoutputformat.blocksize

MapReduce中HFileOutputFormat可以寫 storefiles/hfiles. 這個值是hfile的blocksize的最小值。通常在HBase寫Hfile的時候,bloocksize是由table schema(HColumnDescriptor)決定的,但是在mapreduce寫的時候,我們無法獲取schema中blocksize。這個值越小,你的索引就越大,你隨機訪問需要獲取的數據就越小。如果你的cell都很小,而且你需要更快的隨機訪問,可以把這個值調低。

默認: 65536

hfile.block.cache.size

分配給HFile/StoreFile的block cache佔最大堆(-Xmx setting)的比例。默認是20%,設置爲0就是不分配。

默認: 0.2

hbase.hash.type

哈希函數使用的哈希算法。可以選擇兩個值:: murmur (MurmurHash) 和 jenkins (JenkinsHash). 這個哈希是給 bloom filters用的.

默認: murmur

hbase.master.keytab.file

HMaster server驗證登錄使用的kerberos keytab 文件路徑。(譯者注:HBase使用Kerberos實現安全)

默認:

hbase.master.kerberos.principal

例如. "hbase/[email protected]". HMaster運行需要使用 kerberos principal name. principal name 可以在: user/hostname@DOMAIN 中獲取. 如果 "_HOST" 被用做hostname portion,需要使用實際運行的hostname來替代它。

默認:

hbase.regionserver.keytab.file

HRegionServer驗證登錄使用的kerberos keytab 文件路徑。

默認:

hbase.regionserver.kerberos.principal

例如. "hbase/[email protected]". HRegionServer運行需要使用 kerberos principal name. principal name 可以在: user/hostname@DOMAIN 中獲取. 如果 "_HOST" 被用做hostname portion,需要使用實際運行的hostname來替代它。在這個文件中必須要有一個entry來描述 hbase.regionserver.keytab.file

默認:

zookeeper.session.timeout

ZooKeeper 會話超時.HBase把這個值傳遞改zk集羣,向他推薦一個會話的最大超時時間。詳見http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkSessions "The client sends a requested timeout, the server responds with the timeout that it can give the client. "。 單位是毫秒

默認: 180000

zookeeper.znode.parent

ZooKeeper中的HBase的根ZNode。所有的HBase的ZooKeeper會用這個目錄配置相對路徑。默認情況下,所有的HBase的ZooKeeper文件路徑是用相對路徑,所以他們會都去這個目錄下面。

默認: /hbase

zookeeper.znode.rootserver

ZNode 保存的 根region的路徑. 這個值是由Master來寫,client和regionserver 來讀的。如果設爲一個相對地址,父目錄就是 ${zookeeper.znode.parent}.默認情形下,意味着根region的路徑存儲在/hbase/root-region-server.

默認: root-region-server

hbase.zookeeper.quorum

Zookeeper集羣的地址列表,用逗號分割。例如:"host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".默認是localhost,是給僞分佈式用的。要修改才能在完全分佈式的情況下使用。如果在hbase-env.sh設置了HBASE_MANAGES_ZK,這些ZooKeeper節點就會和HBase一起啓動。

默認: localhost

hbase.zookeeper.peerport

ZooKeeper節點使用的端口。詳細參見:http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperStarted.html#sc_RunningReplicatedZooKeeper

默認: 2888

hbase.zookeeper.leaderport

ZooKeeper用來選擇Leader的端口,詳細參見:http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperStarted.html#sc_RunningReplicatedZooKeeper

默認: 3888

hbase.zookeeper.property.initLimit

ZooKeeper的zoo.conf中的配置。 初始化synchronization階段的ticks數量限制

默認: 10

hbase.zookeeper.property.syncLimit

ZooKeeper的zoo.conf中的配置。 發送一個請求到獲得承認之間的ticks的數量限制

默認: 5

hbase.zookeeper.property.dataDir

ZooKeeper的zoo.conf中的配置。 快照的存儲位置

默認: ${hbase.tmp.dir}/zookeeper

hbase.zookeeper.property.clientPort

ZooKeeper的zoo.conf中的配置。 客戶端連接的端口

默認: 2181

hbase.zookeeper.property.maxClientCnxns

ZooKeeper的zoo.conf中的配置。 ZooKeeper集羣中的單個節點接受的單個Client(以IP區分)的請求的併發數。這個值可以調高一點,防止在單機和僞分佈式模式中出問題。

默認: 2000

hbase.rest.port

HBase REST server的端口

默認: 8080

hbase.rest.readonly

定義REST server的運行模式。可以設置成如下的值: false: 所有的HTTP請求都是被允許的 - GET/PUT/POST/DELETE. true:只有GET請求是被允許的

默認: false

2.3.2. hbase-env.sh

在這個文件裏面設置HBase環境變量。比如可以配置JVM啓動的堆大小或者GC的參數。你還可在這裏配置HBase的參數,如Log位置,niceness(譯者注:優先級),ssh參數還有pid文件的位置等等。打開文件conf/hbase-env.sh細讀其中的內容。每個選項都是有詳盡的註釋的。你可以在此添加自己的環境變量。

這個文件的改動系統HBase重啓才能生效。

2.3.3. log4j.properties

編輯這個文件可以改變HBase的日誌的級別,輪滾策略等等。

這個文件的改動系統HBase重啓才能生效。 日誌級別的更改會影響到HBase UI

2.3.4. 連接HBase集羣的客戶端配置和依賴

因爲HBase的Master有可能轉移,所有客戶端需要訪問ZooKeeper來獲得現在的位置。ZooKeeper會保存這些值。因此客戶端必須知道Zookeeper集羣的地址,否則做不了任何事情。通常這個地址存在 hbase-site.xml 裏面,客戶端可以從CLASSPATH取出這個文件.

如果你是使用一個IDE來運行HBase客戶端,你需要將conf/放入你的 classpath,這樣 hbase-site.xml就可以找到了,(或者把hbase-site.xml放到 src/test/resources,這樣測試的時候可以使用).

HBase客戶端最小化的依賴是 hbase, hadoop, log4j, commons-logging, commons-lang, 和 ZooKeeper ,這些jars 需要能在 CLASSPATH 中找到。

下面是一個基本的客戶端 hbase-site.xml 例子:

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hbase.zookeeper.quorum</name> <value>example1,example2,example3</value> <description>The directory shared by region servers. </description> </property> </configuration>

2.3.4.1. Java客戶端配置

Java是如何讀到hbase-site.xml 的內容的

Java客戶端使用的配置信息是被映射在一個HBaseConfiguration 實例中. HBaseConfiguration有一個工廠方法, HBaseConfiguration.create();,運行這個方法的時候,他會去CLASSPATH,下找hbase-site.xml,讀他發現的第一個配置文件的內容。 (這個方法還會去找hbase-default.xml ; hbase.X.X.X.jar裏面也會有一個an hbase-default.xml). 不使用任何hbase-site.xml文件直接通過Java代碼注入配置信息也是可以的。例如,你可以用編程的方式設置ZooKeeper信息,只要這樣做:

Configuration config = HBaseConfiguration.create(); config.set("hbase.zookeeper.quorum", "localhost"); // Here we are running zookeeper locally

如果有多ZooKeeper實例,你可以使用逗號列表。(就像在hbase-site.xml 文件中做得一樣). 這個 Configuration 實例會被傳遞到 HTable, 之類的實例裏面去.



2.4. 配置示例

2.4.1. 簡單的分佈式HBase安裝

這裏是一個10節點的HBase的簡單示例,這裏的配置都是基本的,節點名爲 example0example1... 一直到 example9 . HBase Master 和 HDFS namenode 運作在同一個節點 example0上. RegionServers 運行在節點 example1-example9. 一個 3-節點 ZooKeeper 集羣運行在example1example2, 和 example3,端口保持默認. ZooKeeper 的數據保存在目錄 /export/zookeeper. 下面我們展示主要的配置文件-- hbase-site.xmlregionservers, 和 hbase-env.sh -- 這些文件可以在 conf目錄找到.

2.4.1.1. hbase-site.xml

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hbase.zookeeper.quorum</name> <value>example1,example2,example3</value> <description>The directory shared by RegionServers. </description> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>/export/zookeeper</value> <description>Property from ZooKeeper's config zoo.cfg. The directory where the snapshot is stored. </description> </property> <property> <name>hbase.rootdir</name> <value>hdfs://example0:9000/hbase</value> <description>The directory shared by RegionServers. </description> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> <description>The mode the cluster will be in. Possible values are false: standalone and pseudo-distributed setups with managed Zookeeper true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh) </description> </property> </configuration>

2.4.1.2. regionservers

這個文件把RegionServer的節點列了下來。在這個例子裏面我們讓所有的節點都運行RegionServer,除了第一個節點 example1,它要運行 HBase Master 和 HDFS namenode

example1 example3 example4 example5 example6 example7 example8 example9

2.4.1.3. hbase-env.sh

下面我們用diff 命令來展示 hbase-env.sh 文件相比默認變化的部分. 我們把HBase的堆內存設置爲4G而不是默認的1G.

$ git diff hbase-env.sh diff --git a/conf/hbase-env.sh b/conf/hbase-env.sh index e70ebc6..96f8c27 100644 --- a/conf/hbase-env.sh +++ b/conf/hbase-env.sh @@ -31,7 +31,7 @@ export JAVA_HOME=/usr/lib//jvm/java-6-sun/ # export HBASE_CLASSPATH= # The maximum amount of heap to use, in MB. Default is 1000. -# export HBASE_HEAPSIZE=1000 +export HBASE_HEAPSIZE=4096 # Extra Java runtime options. # Below are what we set by default. May only work with SUN JVM.

你可以使用 rsync 來同步 conf 文件夾到你的整個集羣.



2.5. 重要的配置

下面我們會列舉重要 的配置. 這個章節講述必須的配置和那些值得一看的配置。(譯者注:淘寶的博客也有本章節的內容,HBase性能調優,很詳盡)。

2.5.1. 必須的配置

參考 Section 2.2, “操作系統” 和 Section 2.3, “Hadoop” 節.

2.5.2. 推薦配置

2.5.2.1. zookeeper.session.timeout

這個默認值是3分鐘。這意味着一旦一個server宕掉了,Master至少需要3分鐘才能察覺到宕機,開始恢復。你可能希望將這個超時調短,這樣Master就能更快的察覺到了。在你調這個值之前,你需要確認你的JVM的GC參數,否則一個長時間的GC操作就可能導致超時。(當一個RegionServer在運行一個長時間的GC的時候,你可能想要重啓並恢復它).

要想改變這個配置,可以編輯 hbase-site.xml, 將配置部署到全部集羣,然後重啓。

我們之所以把這個值調的很高,是因爲我們不想一天到晚在論壇裏回答新手的問題。“爲什麼我在執行一個大規模數據導入的時候Region Server死掉啦”,通常這樣的問題是因爲長時間的GC操作引起的,他們的JVM沒有調優。我們是這樣想的,如果一個人對HBase不很熟悉,不能期望他知道所有,打擊他的自信心。等到他逐漸熟悉了,他就可以自己調這個參數了。

2.5.2.2.  ZooKeeper 實例個數

參考 Section 2.5, “ZooKeeper”.

2.5.2.3. hbase.regionserver.handler.count

這個設置決定了處理用戶請求的線程數量。默認是10,這個值設的比較小,主要是爲了預防用戶用一個比較大的寫緩衝,然後還有很多客戶端併發,這樣region servers會垮掉。有經驗的做法是,當請求內容很大(上MB,如大puts, 使用緩存的scans)的時候,把這個值放低。請求內容較小的時候(gets, 小puts, ICVs, deletes),把這個值放大。

當客戶端的請求內容很小的時候,把這個值設置的和最大客戶端數量一樣是很安全的。一個典型的例子就是一個給網站服務的集羣,put操作一般不會緩衝,絕大多數的操作是get操作。

把這個值放大的危險之處在於,把所有的Put操作緩衝意味着對內存有很大的壓力,甚至會導致OutOfMemory.一個運行在內存不足的機器的RegionServer會頻繁的觸發GC操作,漸漸就能感受到停頓。(因爲所有請求內容所佔用的內存不管GC執行幾遍也是不能回收的)。一段時間後,集羣也會受到影響,因爲所有的指向這個region的請求都會變慢。這樣就會拖累集羣,加劇了這個問題。

你可能會對handler太多或太少有感覺,可以通過 Section 12.2.2.1, “啓用 RPC級 日誌” ,在單個RegionServer啓動log並查看log末尾 (請求隊列消耗內存)。

2.5.2.4. 大內存機器的配置

HBase有一個合理的保守的配置,這樣可以運作在所有的機器上。如果你有臺大內存的集羣-HBase有8G或者更大的heap,接下來的配置可能會幫助你. TODO.

2.5.2.5. 壓縮

應該考慮啓用ColumnFamily 壓縮。有好幾個選項,通過降低存儲文件大小以降低IO,降低消耗且大多情況下提高性能。

參考 Appendix C,  HBase壓縮  獲取更多信息.

2.5.2.6. 較大 Regions

更大的Region可以使你集羣上的Region的總數量較少。 一般來言,更少的Region可以使你的集羣運行更加流暢。(你可以自己隨時手工將大Region切割,這樣單個熱點Region就會被分佈在集羣的更多節點上)。

較少的Region較好。一般每個RegionServer在20到小几百之間。 調整Region大小以適合該數字。

 

0.90.x 版本, 默認情況下單個Region是256MB。Region 大小的上界是 4Gb. 0.92.x 版本, 由於 HFile v2 已經將Region大小支持得大很多, (如, 20Gb).

可能需要實驗,基於硬件和應用需要進行配置。

可以調整hbase-site.xml中的 hbase.hregion.max.filesize屬性. RegionSize 也可以基於每個表設置:  HTableDescriptor.

2.5.2.7. 管理 Splitting

除了讓HBase自動切割你的Region,你也可以手動切割。 [12] 隨着數據量的增大,splite會被持續執行。如果你需要知道你現在有幾個region,比如長時間的debug或者做調優,你需要手動切割。通過跟蹤日誌來了解region級的問題是很難的,因爲他在不停的切割和重命名。data offlineing bug和未知量的region會讓你沒有辦法。如果一個 HLog 或者 StoreFile由於一個奇怪的bug,HBase沒有執行它。等到一天之後,你才發現這個問題,你可以確保現在的regions和那個時候的一樣,這樣你就可以restore或者replay這些數據。你還可以調優你的合併算法。如果數據是均勻的,隨着數據增長,很容易導致split / compaction瘋狂的運行。因爲所有的region都是差不多大的。用手的切割,你就可以交錯執行定時的合併和切割操作,降低IO負載。

爲什麼我關閉自動split呢?因爲自動的splite是配置文件中的 hbase.hregion.max.filesize決定的. 你把它設置成ILong.MAX_VALUE是不推薦的做法,要是你忘記了手工切割怎麼辦.推薦的做法是設置成100GB,一旦到達這樣的值,至少需要一個小時執行 major compactions。

那什麼是最佳的在pre-splite regions的數量呢。這個決定於你的應用程序了。你可以先從低的開始,比如每個server10個pre-splite regions.然後花時間觀察數據增長。有太少的region至少比出錯好,你可以之後再rolling split.一個更復雜的答案是這個值是取決於你的region中的最大的storefile。隨着數據的增大,這個也會跟着增大。 你可以當這個文件足夠大的時候,用一個定時的操作使用Store的合併選擇算法(compact selection algorithm)來僅合併這一個HStore。如果你不這樣做,這個算法會啓動一個 major compactions,很多region會受到影響,你的集羣會瘋狂的運行。需要注意的是,這樣的瘋狂合併操作是數據增長造成的,而不是手動分割操作決定的。

如果你 pre-split 導致 regions 很小,你可以通過配置HConstants.MAJOR_COMPACTION_PERIOD把你的major compaction參數調大

如果你的數據變得太大,可以使用org.apache.hadoop.hbase.util.RegionSplitter 腳本來執行鍼對全部集羣的一個網絡IO安全的rolling split操作。

 

2.5.2.8. 管理 Compactions

通常管理技術是手動管理主壓縮(major compactions), 而不是讓HBase 來做。 缺省HConstants.MAJOR_COMPACTION_PERIOD 是一天。主壓縮可能強行進行,在你並不太希望發生的時候——特別是在一個繁忙系統。關閉自動主壓縮,設置該值爲0.

重點強調,主壓縮對存儲文件(StoreFile)清理是絕對必要的。唯一變量是發生的時間。可以通過HBase shell進行管理,或通過 HBaseAdmin.

更多信息關於壓縮和壓縮文件選擇過程,參考 Section 9.7.5.5, “壓縮”

2.5.2.9.  預測執行 (Speculative Execution)

MapReduce任務的預測執行缺省是打開的,HBase集羣一般建議在系統級關閉預測執行,除非在某種特殊情況下需要打開,此時可以每任務配置。設置mapred.map.tasks.speculative.execution 和 mapred.reduce.tasks.speculative.execution 爲 false.

2.5.3. 其他配置

2.5.3.1. 負載均衡

負載均衡器(LoadBalancer)是在主服務器上運行的定期操作,以重新分佈集羣區域。通過hbase.balancer.period 設置,缺省值300000 (5 分鐘).

參考 Section 9.5.4.1, “負載均衡” 獲取關於負載均衡器( LoadBalancer )的更多信息。

2.5.3.2. 禁止塊緩存(Blockcache)

不要關閉塊緩存 (通過hbase.block.cache.size 爲 0 來設置)。當前如果關閉塊緩存會很不好,因爲區域服務器會花很多時間不停加載hfile指數。如果工作集如此配置塊緩存沒有好處,最少應保證hfile指數保存在塊緩存內的大小(可以通過查詢區域服務器的UI,得到大致的數值。可以看到網頁的上方有塊指數值統計).

2.5.3.3. Nagle's or the small package problem

If a big 40ms or so occasional delay is seen in operations against HBase, try the Nagles' setting. For example, see the user mailing list thread, Inconsistent scan performance with caching set to 1 and the issue cited therein where setting notcpdelay improved scan speeds. You might also see the graphs on the tail of HBASE-7008 Set scanner caching to a better default where our Lars Hofhansl tries various data sizes w/ Nagle's on and off measuring the effect.




 

[1] Be careful editing XML. Make sure you close all elements. Run your file through xmllint or similar to ensure well-formedness of your document after an edit session.

[2] The hadoop-dns-checker tool can be used to verify DNS is working correctly on the cluster. The project README file provides detailed instructions on usage.

[3] 參考 Jack Levin's major hdfs issues note up on the user list.

[4] The requirement that a database requires upping of system limits is not peculiar to HBase. 參考 for example the section Setting Shell Limits for the Oracle User inShort Guide to install Oracle 10 on Linux.

[5] A useful read setting config on you hadoop cluster is Aaron Kimballs' Configuration Parameters: What can you just ignore?

[6] <title>On Hadoop Versions</title>

[6] The Cloudera blog post An update on Apache Hadoop 1.0 by Charles Zedlweski has a nice exposition on how all the Hadoop versions relate. Its worth checking out if you are having trouble making sense of the Hadoop version morass.

[7] Until recently only the branch-0.20-append branch had a working sync but no official release was ever made from this branch. You had to build it yourself. Michael Noll wrote a detailed blog, Building an Hadoop 0.20.x version for HBase 0.90.2, on how to build an Hadoop from branch-0.20-append. Recommended.

[8] Praveen Kumar has written a complimentary article, Building Hadoop and HBase for HBase Maven application development.

[9] dfs.support.append

[10] 參考 Hadoop HDFS: Deceived by Xciever for an informative rant on xceivering.

[11] The pseudo-distributed vs fully-distributed nomenclature comes from Hadoop.

[12] 參考 Section 2.4.2.1.2, “Pseudo-distributed Extras” for notes on how to start extra Masters and RegionServers when running pseudo-distributed.

[13] 對 ZooKeeper 全部配置,參考ZooKeeper 的zoo.cfg. HBase 沒有包含 zoo.cfg ,所以需要瀏覽合適的獨立ZooKeeper下載版本的 conf 目錄找到。

[14] What follows is taken from the javadoc at the head of the org.apache.hadoop.hbase.util.RegionSplitter tool added to HBase post-0.90.0 release.

 

Chapter 3. 升級

You cannot skip major verisons upgrading. If you are upgrading from version 0.20.x to 0.92.x, you must first go from 0.20.x to 0.90.x and then go from 0.90.x to 0.92.x.

參見 Section 2, “配置”, 需要特別注意有關Hadoop 版本的信息.

3.1. 從 0.94.x 升級到 0.96.x

The Singularity

You will have to stop your old 0.94 cluster completely to upgrade. If you are replicating between clusters, both clusters will have to go down to upgrade. Make sure it is a clean shutdown so there are no WAL files laying around (TODO: Can 0.96 read 0.94 WAL files?). Make sure zookeeper is cleared of state. All clients must be upgraded to 0.96 too.

The API has changed in a few areas; in particular how you use coprocessors (TODO: MapReduce too?)

3.2. 從 0.92.x 升級到 0.94.x

0.92 和 0.94 接口兼容,可平滑升級。

 

3.3. 從 0.90.x 到 0.92.x 升級

升級指引

You will find that 0.92.0 runs a little differently to 0.90.x releases. Here are a few things to watch out for upgrading from 0.90.x to 0.92.0.

 

If you've not patience, here are the important things to know upgrading.

  1. Once you upgrade, you can’t go back.
  2. MSLAB is on by default. Watch that heap usage if you have a lot of regions.
  3. Distributed splitting is on by defaul. It should make region server failover faster.
  4. There’s a separate tarball for security.
  5. If -XX:MaxDirectMemorySize is set in your hbase-env.sh, it’s going to enable the experimental off-heap cache (You may not want this).

3.3.1. 不可回退!

To move to 0.92.0, all you need to do is shutdown your cluster, replace your hbase 0.90.x with hbase 0.92.0 binaries (be sure you clear out all 0.90.x instances) and restart (You cannot do a rolling restart from 0.90.x to 0.92.x -- you must restart). On startup, the .META. table content is rewritten removing the table schema from the info:regioninfo column. Also, any flushes done post first startup will write out data in the new 0.92.0 file format, HFile V2. This means you cannot go back to 0.90.x once you’ve started HBase 0.92.0 over your HBase data directory.

3.3.2. MSLAB 缺省啓用

In 0.92.0, the hbase.hregion.memstore.mslab.enabled flag is set to true (參考 Section 11.3.1.1, “Long GC pauses”). In 0.90.x it was false. When it is enabled, memstores will step allocate memory in MSLAB 2MB chunks even if the memstore has zero or just a few small elements. This is fine usually but if you had lots of regions per regionserver in a 0.90.x cluster (and MSLAB was off), you may find yourself OOME'ing on upgrade because the thousands of regions * number of column families * 2MB MSLAB (at a minimum) puts your heap over the top. Set hbase.hregion.memstore.mslab.enabled to false or set the MSLAB size down from 2MB by setting hbase.hregion.memstore.mslab.chunksize to something less.

3.3.3. 分佈式分割缺省啓用

Previous, WAL logs on crash were split by the Master alone. In 0.92.0, log splitting is done by the cluster (參考 “HBASE-1364 [performance] Distributed splitting of regionserver commit logs”). This should cut down significantly on the amount of time it takes splitting logs and getting regions back online again.

3.3.4. 內存計算改變

In 0.92.0, Appendix E, HFile format version 2 indices and bloom filters take up residence in the same LRU used caching blocks that come from the filesystem. In 0.90.x, the HFile v1 indices lived outside of the LRU so they took up space even if the index was on a ‘cold’ file, one that wasn’t being actively used. With the indices now in the LRU, you may find you have less space for block caching. Adjust your block cache accordingly. 參考 the Section 9.6.4, “Block Cache” for more detail. The block size default size has been changed in 0.92.0 from 0.2 (20 percent of heap) to 0.25.

3.3.5.  可用 Hadoop 版本

Run 0.92.0 on Hadoop 1.0.x (or CDH3u3 when it ships). The performance benefits are worth making the move. Otherwise, our Hadoop prescription is as it has been; you need an Hadoop that supports a working sync. 參考 Section 2.3, “Hadoop”.

If running on Hadoop 1.0.x (or CDH3u3), enable local read. 參考 Practical Caching presentation for ruminations on the performance benefits ‘going local’ (and for how to enable local reads).

3.3.6. HBase 0.92.0 帶 ZooKeeper 3.4.2

If you can, upgrade your zookeeper. If you can’t, 3.4.2 clients should work against 3.3.X ensembles (HBase makes use of 3.4.2 API).

3.3.7. 在線切換缺省關閉

In 0.92.0, we’ve added an experimental online schema alter facility (參考 hbase.online.schema.update.enable). Its off by default. Enable it at your own risk. Online alter and splitting tables do not play well together so be sure your cluster quiescent using this feature (for now).

3.3.8. WebUI

The webui has had a few additions made in 0.92.0. It now shows a list of the regions currently transitioning, recent compactions/flushes, and a process list of running processes (usually empty if all is well and requests are being handled promptly). Other additions including requests by region, a debugging servlet dump, etc.

3.3.9. 安全 tarball

我們發佈兩個tarball: 安全和非安全 HBase. 如何設置安全HBase的文檔正在制定中。

3.3.10. 試驗離堆(off-heap)緩存

(譯者注:on-heap和off-heap是Terracotta 公司提出的概念。on-heap指java對象在GC內存儲管理,效率較高,但GC只能管理2G內存,有時成爲性能瓶頸。off-heap又叫BigMemory ,是JVM的GC機制的替代,在GC外存儲,100倍速於DiskStore,cache量目前(2012年底)達到350GB)

A new cache was contributed to 0.92.0 to act as a solution between using the “on-heap” cache which is the current LRU cache the region servers have and the operating system cache which is out of our control. To enable, set “-XX:MaxDirectMemorySize” in hbase-env.sh to the value for maximum direct memory size and specify hbase.offheapcache.percentage in hbase-site.xml with the percentage that you want to dedicate to off-heap cache. This should only be set for servers and not for clients. Use at your own risk. See this blog post for additional information on this new experimental feature: http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/

3.3.11. HBase 複製的變動

0.92.0 adds two new features: multi-slave and multi-master replication. The way to enable this is the same as adding a new peer, so in order to have multi-master you would just run add_peer for each cluster that acts as a master to the other slave clusters. Collisions are handled at the timestamp level which may or may not be what you want, this needs to be evaluated on a per use case basis. Replication is still experimental in 0.92 and is disabled by default, run it at your own risk.

3.3.12. 對OOME ,RegionServer 現在退出

If an OOME, we now have the JVM kill -9 the regionserver process so it goes down fast. Previous, a RegionServer might stick around after incurring an OOME limping along in some wounded state. To disable this facility, and recommend you leave it in place, you’d need to edit the bin/hbase file. Look for the addition of the -XX:OnOutOfMemoryError="kill -9 %p" arguments (參考 [HBASE-4769] - ‘Abort RegionServer Immediately on OOME’)

3.3.13. HFile V2 和 “更大, 更少” 趨勢

0.92.0 stores data in a new format, Appendix E, HFile format version 2. As HBase runs, it will move all your data from HFile v1 to HFile v2 format. This auto-migration will run in the background as flushes and compactions run. HFile V2 allows HBase run with larger regions/files. In fact, we encourage that all HBasers going forward tend toward Facebook axiom #1, run with larger, fewer regions. If you have lots of regions now -- more than 100s per host -- you should look into setting your region size up after you move to 0.92.0 (In 0.92.0, default size is not 1G, up from 256M), and then running online merge tool (參考 “HBASE-1621 merge tool should work on online cluster, but disabled table”).

 

3.4. 從HBase 0.20.x or 0.89.x 升級到 HBase 0.90.x

0.90.x 版本的HBase可以在 HBase 0.20.x 或者 HBase 0.89.x的數據上啓動. 不需要轉換數據文件, HBase 0.89.x 和 0.90.x 的region目錄名是不一樣的 -- 老版本用md5 hash 而不是jenkins hash 來命名region-- 這就意味着,一旦啓動,再也不能回退到 HBase 0.20.x.

在升級的時候,一定要將hbase-default.xml 從你的 conf目錄刪掉。 0.20.x 版本的配置對於 0.90.x HBase不是最佳的. hbase-default.xml 現在已經被打包在 HBase jar 裏面了. 如果你想看看這個文件內容,你可以在src目錄下 src/main/resources/hbase-default.xml 或者在 Section 2.31.1, “HBase 默認配置”看到.

最後,如果從0.20.x升級,需要在shell裏檢查 .META. schema . 過去,我們推薦用戶使用16KB的 MEMSTORE_FLUSHSIZE. 在shell中運行 hbase> scan '-ROOT-'. 會顯示當前的.META. schema. 檢查 MEMSTORE_FLUSHSIZE 的大小. 看看是不是 16KB (16384)? 如果是的話,你需要修改它(默認的值是 64MB (67108864)) 運行腳本bin/set_meta_memstore_size.rb. 這個腳本會修改 .META. schema. 如果不運行的話,集羣會比較慢[15] .



 

Chapter 4.  HBase Shell

HBase Shell is 在(J)Ruby的IRB的基礎上加上了HBase的命令。任何你可以在IRB裏做的事情都可在在HBase Shell中做。

你可以這樣來運行HBase Shell:

$ ./bin/hbase shell

輸入 help 就會返回Shell的命令列表和選項。可以看看在Help文檔尾部的關於如何輸入變量和選項。尤其要注意的是表名,行,列名必須要加引號。

參見 Section 1.2.3, “Shell 練習”可以看到Shell的基本使用例子。

4.1. 使用腳本

如果要使用腳本,可以看HBase的bin 目錄.在裏面找到後綴爲 *.rb的腳本.要想運行這個腳本,要這樣

$ ./bin/hbase org.jruby.Main PATH_TO_SCRIPT

就可以了

4.2. Shell 技巧

4.2.1. irbrc

可以在你自己的Home目錄下創建一個.irbrc文件. 在這個文件里加入自定義的命令。有一個有用的命令就是記錄命令歷史,這樣你就可以把你的命令保存起來。

$ more .irbrc require 'irb/ext/save-history' IRB.conf[:SAVE_HISTORY] = 100 IRB.conf[:HISTORY_FILE] = "#{ENV['HOME']}/.irb-save-history"

可以參見 ruby 關於 .irbrc 的文檔來學習更多的關於IRB的配置方法。

4.2.2. LOG 時間轉換

可以將日期'08/08/16 20:56:29'從hbase log 轉換成一個 timestamp, 操作如下:

hbase(main):021:0> import java.text.SimpleDateFormat hbase(main):022:0> import java.text.ParsePosition hbase(main):023:0> SimpleDateFormat.new("yy/MM/dd HH:mm:ss").parse("08/08/16 20:56:29", ParsePosition.new(0)).getTime() => 1218920189000

也可以逆過來操作。

hbase(main):021:0> import java.util.Date hbase(main):022:0> Date.new(1218920189000).toString() => "Sat Aug 16 20:56:29 UTC 2008"

要想把日期格式和HBase log格式完全相同,可以參見文檔 SimpleDateFormat.

4.2.3. 調試

4.2.3.1. Shell 切換成debug 模式

你可以將shell切換成debug模式。這樣可以看到更多的信息。 -- 例如可以看到命令異常的stack trace:

hbase> debug <RETURN>

4.2.3.2. DEBUG log level

想要在shell中看到 DEBUG 級別的 logging ,可以在啓動的時候加上 -d 參數.

$ ./bin/hbase shell -d

Chapter 5. 數據模型

簡單來說,應用程序是以表的方式在HBase存儲數據的。表是由行和列構成的,所有的列是從屬於某一個列族的。行和列的交叉點稱之爲cell,cell是版本化的。cell的內容是不可分割的字節數組。

表的行鍵也是一段字節數組,所以任何東西都可以保存進去,不論是字符串或者數字。HBase的表是按key排序的,排序方式之針對字節的。所有的表都必須要有主鍵-key.

5.1. 概念視圖

下面是根據BigTable 論文稍加修改的例子。 有一個名爲webtable的表,包含兩個列族:contentsanchor.在這個例子裏面,anchor有兩個列 (anchor:cssnsi.com,anchor:my.look.ca),contents僅有一列(contents:html)

列名

一個列名是由它的列族前綴和修飾符(qualifier)連接而成。例如列contents:html是列族 contents加冒號(:)加 修飾符 html組成的。

Table 5.1. 表 webtable

Row Key Time Stamp ColumnFamily contents ColumnFamily anchor
"com.cnn.www" t9   anchor:cnnsi.com = "CNN"
"com.cnn.www" t8   anchor:my.look.ca = "CNN.com"
"com.cnn.www" t6 contents:html = "<html>..."  
"com.cnn.www" t5 contents:html = "<html>..."  
"com.cnn.www" t3 contents:html = "<html>..."  


5.2. 物理視圖

儘管在概念視圖裏,表可以被看成是一個稀疏的行的集合。但在物理上,它的是區分列族 存儲的。新的columns可以不經過聲明直接加入一個列族.

Table 5.2. ColumnFamily anchor

Row Key Time Stamp Column Family anchor
"com.cnn.www" t9 anchor:cnnsi.com = "CNN"
"com.cnn.www" t8 anchor:my.look.ca = "CNN.com"


Table 5.3. ColumnFamily contents

Row Key Time Stamp ColumnFamily "contents:"
"com.cnn.www" t6 contents:html = "<html>..."
"com.cnn.www" t5 contents:html = "<html>..."
"com.cnn.www" t3 contents:html = "<html>..."


值得注意的是在上面的概念視圖中空白cell在物理上是不存儲的,因爲根本沒有必要存儲。因此若一個請求爲要獲取t8時間的contents:html,他的結果就是空。相似的,若請求爲獲取t9時間的anchor:my.look.ca,結果也是空。但是,如果不指明時間,將會返回最新時間的行,每個最新的都會返回。例如,如果請求爲獲取行鍵爲"com.cnn.www",沒有指明時間戳的話,活動的結果是t6下的contents:html,t9下的anchor:cnnsi.comt8anchor:my.look.ca

For more information about the internals of how HBase stores data, see Section 9.7, “Regions”.

5.3. 表

表是在schema聲明的時候定義的。

5.4. 行

行鍵是不可分割的字節數組。行是按字典排序由低到高存儲在表中的。一個空的數組是用來標識表空間的起始或者結尾。

5.5. 列族

在HBase是列族一些列的集合。一個列族所有列成員是有着相同的前綴。比如,列courses:history 和 courses:math都是 列族 courses的成員.冒號(:)是列族的分隔符,用來區分前綴和列名。column 前綴必須是可打印的字符,剩下的部分(稱爲qualify),可以又任意字節數組組成。列族必須在表建立的時候聲明。column就不需要了,隨時可以新建。

在物理上,一個的列族成員在文件系統上都是存儲在一起。因爲存儲優化都是針對列族級別的,這就意味着,一個colimn family的所有成員的是用相同的方式訪問的。

5.6. Cells

{row, column, version} 元組就是一個HBase中的一個 cell。Cell的內容是不可分割的字節數組。

5.7. 數據模型操作

四個主要的數據模型操作是 Get, Put, Scan, 和 Delete. 通過 HTable 實例進行操作.

5.7.1. Get

Get 返回特定行的屬性。 Gets 通過 HTable.get 執行。

5.7.2. Put

Put 要麼向表增加新行 (如果key是新的) 或更新行 (如果key已經存在)。 Puts 通過 HTable.put (writeBuffer) 或 HTable.batch (non-writeBuffer)執行。

5.7.3. Scans

Scan 允許多行特定屬性迭代。

下面是一個在 HTable 表實例上的示例。 假設表有幾行鍵值爲 "row1", "row2", "row3", 還有一些行有鍵值 "abc1", "abc2", 和 "abc3". 下面的示例展示startRow 和 stopRow 可以應用到一個Scan 實例,以返回"row"打頭的行。

HTable htable = ... // instantiate HTable Scan scan = new Scan(); scan.addColumn(Bytes.toBytes("cf"),Bytes.toBytes("attr")); scan.setStartRow( Bytes.toBytes("row")); // start key is inclusive scan.setStopRow( Bytes.toBytes("row" + (char)0)); // stop key is exclusive ResultScanner rs = htable.getScanner(scan); try { for (Result r = rs.next(); r != null; r = rs.next()) { // process result... } finally { rs.close(); // always close the ResultScanner! }

5.7.4. Delete

Delete 從表中刪除一行. 刪除通過HTable.delete 執行。

HBase 沒有修改數據的合適方法。所以通過創建名爲墓碑(tombstones)的新標誌進行處理。這些墓碑和死去的值,在主壓縮時清除。

參考 Section 5.8.1.5, “Delete” 獲取刪除列版本的更多信息。參考Section 9.7.5.5, “Compaction” 獲取更多有關壓縮的信息。

5.8. 版本

一個 {row, column, version} 元組是HBase中的一個單元(cell).但是有可能會有很多的單元的行和列是相同的,可以使用版本來區分不同的單元.

rows和column key是用字節數組表示的,version則是用一個長整型表示。這個long的值使用 java.util.Date.getTime() 或者 System.currentTimeMillis()產生的。這就意味着他的含義是“當前時間和1970-01-01 UTC的時間差,單位毫秒。”

在HBase中,版本是按倒序排列的,因此當讀取這個文件的時候,最先找到的是最近的版本。

有些人不是很理解HBase單元(cell)的意思。一個常見的問題是:

  • 如果有多個包含版本寫操作同時發起,HBase會保存全部還是會保持最新的一個?[16]

  • 可以發起包含版本的寫操作,但是他們的版本順序和操作順序相反嗎?[17]

下面我們介紹下在HBase中版本是如何工作的。[18].

5.8.1. HBase的操作(包含版本操作)

在這一章我們來仔細看看在HBase的各個主要操作中版本起到了什麼作用。

5.8.1.1. Get/Scan

Gets實在Scan的基礎上實現的。可以詳細參見下面的討論 Get 同樣可以用 Scan來描述.

默認情況下,如果你沒有指定版本,當你使用Get操作的時候,會返回最近版本的Cell(該Cell可能是最新寫入的,但不能保證)。默認的操作可以這樣修改:

  • 如果想要返回返回兩個以上的把版本,參見Get.setMaxVersions()

  • 如果想要返回的版本不只是最近的,參見 Get.setTimeRange()

    要向查詢的最新版本要小於或等於給定的這個值,這就意味着給定的'最近'的值可以是某一個時間點。可以使用0到你想要的時間來設置,還要把max versions設置爲1.

5.8.1.2. 默認 Get 例子

下面的Get操作會只獲得最新的一個版本。

Get get = new Get(Bytes.toBytes("row1")); Result r = htable.get(get); byte[] b = r.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr")); // returns current version of value

5.8.1.3. 含有的版本的Get例子

下面的Get操作會獲得最近的3個版本。

Get get = new Get(Bytes.toBytes("row1")); get.setMaxVersions(3); // will return last 3 versions of row Result r = htable.get(get); byte[] b = r.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr")); // returns current version of value List<KeyValue> kv = r.getColumn(Bytes.toBytes("cf"), Bytes.toBytes("attr")); // returns all versions of this column

5.8.1.4. Put

一個Put操作會給一個cell,創建一個版本,默認使用當前時間戳,當然你也可以自己設置時間戳。這就意味着你可以把時間設置在過去或者未來,或者隨意使用一個Long值。

要想覆蓋一個現有的值,就意味着你的row,column和版本必須完全相等。

5.8.1.4.1. 不指明版本的例子

下面的Put操作不指明版本,所以HBase會用當前時間作爲版本。

Put put = new Put(Bytes.toBytes(row)); put.add(Bytes.toBytes("cf"), Bytes.toBytes("attr1"), Bytes.toBytes( data)); htable.put(put);

5.8.1.4.2. 指明版本的例子

下面的Put操作,指明瞭版本。

Put put = new Put( Bytes.toBytes(row )); long explicitTimeInMs = 555; // just an example put.add(Bytes.toBytes("cf"), Bytes.toBytes("attr1"), explicitTimeInMs, Bytes.toBytes(data)); htable.put(put);

5.8.1.5. Delete

有三種不同類型的內部刪除標記  [19]:

  • Delete: 刪除列的指定版本.

  • Delete column: 刪除列的所有版本.

  • Delete family: 刪除特定列族所有列

當刪除一行,HBase將內部對每個列族創建墓碑(非每個單獨列)。

刪除操作的實現是創建一個墓碑標記。例如,我們想要刪除一個版本,或者默認是currentTimeMillis。就意味着“刪除比這個版本更早的所有版本”.HBase不會去改那些數據,數據不會立即從文件中刪除。他使用刪除標記來屏蔽掉這些值。[20]若你知道的版本比數據中的版本晚,就意味着這一行中的所有數據都會被刪除。

參考 Section 9.7.5.4, “KeyValue” 獲取內部 KeyValue 格式更多信息。

5.8.2. 現有的限制

關於版本還有一些bug(或者稱之爲未實現的功能),計劃在下個版本實現。

5.8.2.1. 刪除標記誤標新Put 的數據

刪除標記操作可能會標記其後put的數據。[21]記住,當寫下一個墓碑標記後,只有下一個主壓縮操作發起之後,墓碑纔會清除。假設你刪除所有<= 時間T的數據。但之後,你又執行了一個Put操作,時間戳<= T。就算這個Put發生在刪除操作之後,他的數據也打上了墓碑標記。這個Put並不會失敗,但你做Get操作時,會注意到Put沒有產生影響。只有一個主壓縮執行後,一切纔會恢復正常。如果你的Put操作一直使用升序的版本,這個問題不會有影響。但是即使你不關心時間,也可能出現該情況。只需刪除和插入迅速相互跟隨,就有機會在同一毫秒中遇到。

5.8.2.2. 主壓縮改變查詢的結果

“設想一下,你一個cell有三個版本t1,t2和t3。你的maximun-version設置是2.當你請求獲取全部版本的時候,只會返回兩個,t2和t3。如果你將t2和t3刪除,就會返回t1。但是如果在刪除之前,發生了major compaction操作,那麼什麼值都不好返回了。[22]


5.9. 排序

所有數據模型操作 HBase 返回排序的數據。先是行,再是列族,然後是列修飾(column qualifier), 最後是時間戳(反向排序,所以最新的在前).

5.10. 列的元數據

對列族,沒有內部的KeyValue之外的元數據保存。這樣,HBase不僅在一行中支持很多列,而且支持行之間不同的列。 由你自己負責跟蹤列名。

唯一獲取列族的完整列名的方法是處理所有行。HBase內部保存數據更多信息,請參考 Section 9.7.5.4, “KeyValue”.

5.11. 聯合查詢(Join)

HBase是否支持聯合是一個網上常問問題。簡單來說 : 不支持。至少不想傳統RDBMS那樣支持(如 SQL中帶 equi-joins 或 outer-joins). 正如本章描述的,讀數據模型是 Get 和 Scan.

但並不表示等價聯合不能在應用程序中支持,只是必須自己做。 兩種方法,要麼指示要寫到HBase的數據,要麼查詢表並在應用或MapReduce代碼中做聯合(如 RDBMS所展示,有幾種步驟來實現,依賴於表的大小。如 nested loops vs. hash-joins). 哪個更好?依賴於你準備做什麼,所以沒有一個單一的回答適合所有方面。


[16目前,只有最新的那個是可以獲取到的。.

[17可以

[18參考 HBASE-2406 for discussion of HBase versions. Bending time in HBase makes for a good read on the version, or time, dimension in HBase. It has more detail on versioning than is provided here. As of this writing, the limiitation Overwriting values at existing timestamps mentioned in the article no longer holds in HBase. This section is basically a synopsis of this article by Bruno Dumon.

[19] 參考 Lars Hofhansl's blog for discussion of his attempt adding another, Scanning in HBase: Prefix Delete Marker

[20當HBase執行一次major compaction,標記刪除的數據會被實際的刪除,刪除標記也會被刪除。

[22參考垃圾收集: Bending time in HBase

 

可以使用HBaseAdmin或者Chapter 4, HBase Shell 來創建和編輯HBase的schemas

表必須禁用以修改列族,如:

Configuration config = HBaseConfiguration.create(); HBaseAdmin admin = new HBaseAdmin(conf); String table = "myTable"; admin.disableTable(table); HColumnDescriptor cf1 = ...; admin.addColumn(table, cf1); // adding new ColumnFamily HColumnDescriptor cf2 = ...; admin.modifyColumn(table, cf2); // modifying existing ColumnFamily admin.enableTable(table);
參考 Section 2.6.4, “Client configuration and dependencies connecting to an HBase cluster” ,獲取更多配置客戶端連接的信息。

注意: 0.92.x 支持在線修改模式, 但 0.90.x 需要禁用表。

6.1.1. 模式更新

當表或列族改變時(如 region size, block size), 當下次存在主壓縮及存儲文件重寫時起作用。

參考 Section 9.7.5, “Store” 獲取存儲文件的更多信息。


6.2.  列族的數量

現在HBase並不能很好的處理兩個或者三個以上的列族,所以儘量讓你的列族數量少一些。目前,flush和compaction操作是針對一個Region。所以當一個列族操作大量數據的時候會引發一個flush。那些不相關的列族也有進行flush操作,儘管他們沒有操作多少數據。Compaction操作現在是根據一個列族下的全部文件的數量觸發的,而不是根據文件大小觸發的。當很多的列族在flush和compaction時,會造成很多沒用的I/O負載(要想解決這個問題,需要將flush和compaction操作只針對一個列族) 。 更多壓縮信息, 參考Section 9.7.5.5, “Compaction”.

儘量在你的應用中使用一個列族。只有你的所有查詢操作只訪問一個列族的時候,可以引入第二個和第三個列族.例如,你有兩個列族,但你查詢的時候總是訪問其中的一個,從來不會兩個一起訪問。

6.2.1. 列族的基數

一個表存在多列族,注意基數(如, 行數). 如果列族A有100萬行,列族B有10億行,列族A可能被分散到很多很多區(及區服務器)。這導致掃描列族A低效。

 

6.3.  行鍵(RowKey)設計

6.3.1. 單調遞增行鍵/時序數據

在Tom White的Hadoop: The Definitive Guide一書中,有一個章節描述了一個值得注意的問題:在一個集羣中,一個導入數據的進程一動不動,所有的client都在等待一個region(就是一個節點),過了一會後,變成了下一個region...如果使用了單調遞增或者時序的key就會造成這樣的問題。詳情可以參見IKai畫的漫畫monotonically increasing values are bad。使用了順序的key會將本沒有順序的數據變得有順序,把負載壓在一臺機器上。所以要儘量避免時間戳或者(e.g. 1, 2, 3)這樣的key。

如果你需要導入時間順序的文件(如log)到HBase中,可以學習OpenTSDB的做法。他有一個頁面來描述他的schema.OpenTSDB的Key的格式是[metric_type][event_timestamp],乍一看,似乎違背了不將timestamp做key的建議,但是他並沒有將timestamp作爲key的一個關鍵位置,有成百上千的metric_type就足夠將壓力分散到各個region了。

6.3.2. 儘量最小化行和列的大小(爲何我的存儲文件指示很大?)

在HBase中,值是作爲一個單元(Cell)保存在系統的中的,要定位一個單元,需要行,列名和時間戳。通常情況下,如果你的行和列的名字要是太大(甚至比value的大小還要大)的話,你可能會遇到一些有趣的情況。例如Marc Limotte 在 HBASE-3551(推薦!)尾部提到的現象。在HBase的存儲文件Section 9.7.5.2, “StoreFile (HFile)”中,有一個索引用來方便值的隨機訪問,但是訪問一個單元的座標要是太大的話,會佔用很大的內存,這個索引會被用盡。所以要想解決,可以設置一個更大的塊大小,當然也可以使用更小的列名 。壓縮也能得到更大指數。參考話題 a question storefileIndexSize 用戶郵件列表.

大部分時候,小的低效不會影響很大。不幸的是,這裏會是個問題。無論是列族,屬性和行鍵都會在數據中重複上億次。參考 Section 9.7.5.4, “KeyValue” 獲取更多信息,關於HBase 內部保存數據,瞭解爲什麼這很重要。

6.3.2.1. 列族

儘量使列族名小,最好一個字符。(如 "d" 表示 data/default).

參考 Section 9.7.5.4, “KeyValue” 獲取更多信息,關於HBase 內部保存數據,瞭解爲什麼這很重要。

6.3.2.2. 屬性

詳細屬性名 (如, "myVeryImportantAttribute") 易讀,最好還是用短屬性名 (e.g., "via") 保存到HBase.

參考 Section 9.7.5.4, “KeyValue” 獲取更多信息,關於HBase 內部保存數據,瞭解爲什麼這很重要。t.

6.3.2.3. 行鍵長度

讓行鍵短到可讀即可,這樣對獲取數據有用(e.g., Get vs. Scan)。 短鍵對訪問數據無用,並不比長鍵對get/scan更好。設計行鍵需要權衡。

6.3.2.4. 字節模式

long 類型有 8 字節. 8字節內可以保存無符號數字到18,446,744,073,709,551,615. 如果用字符串保存--假設一個字節一個字符--,需要將近3倍的字節數。

不信? 下面是示例代碼,可以自己運行一下。

// long // long l = 1234567890L; byte[] lb = Bytes.toBytes(l); System.out.println("long bytes length: " + lb.length); // returns 8 String s = "" + l; byte[] sb = Bytes.toBytes(s); System.out.println("long as string length: " + sb.length); // returns 10 // hash // MessageDigest md = MessageDigest.getInstance("MD5"); byte[] digest = md.digest(Bytes.toBytes(s)); System.out.println("md5 digest bytes length: " + digest.length); // returns 16 String sDigest = new String(digest); byte[] sbDigest = Bytes.toBytes(sDigest); System.out.println("md5 digest as string length: " + sbDigest.length); // returns 26(譯者注:實測值爲22)

6.3.3. 倒序時間戳

一個數據庫處理的通常問題是找到最近版本的值。採用倒序時間戳作爲鍵的一部分可以對此特定情況有很大幫助。也在Tom White的Hadoop書籍的HBase 章節能找到: The Definitive Guide (O'Reilly), 該技術包含追加(Long.MAX_VALUE - timestamp) 到key的後面,如 [key][reverse_timestamp].

表內[key]的最近的值可以用[key]進行 Scan 找到並獲取第一個記錄。由於 HBase 行鍵是排序的,該鍵排在任何比它老的行鍵的前面,所以必然是第一個。

該技術可以用於代替Section 6.4, “ 版本的數量 ” ,其目的是保存所有版本到“永遠”(或一段很長時間) 。同時,採用同樣的Scan技術,可以很快獲取其他版本。

6.3.4. 行鍵和列族

行鍵在列族範圍內。所以同樣的行鍵可以在同一個表的每個列族中存在而不會衝突。

6.3.5. 行鍵永遠不變

行鍵不能改變。唯一可以“改變”的方式是刪除然後再插入。這是一個網上常問問題,所以要注意開始就要讓行鍵正確(且/或在插入很多數據之前)。

6.4.  版本數量

6.4.1. 最大版本數

行的版本的數量是HColumnDescriptor設置的,每個列族可以單獨設置,默認是3。這個設置是很重要的,在Chapter 5, 數據模型有描述,因爲HBase是不會去覆蓋一個值的,他只會在後面在追加寫,用時間戳來區分、過早的版本會在執行主壓縮的時候刪除。這個版本的值可以根據具體的應用增加減少。

不推薦將版本最大值設到一個很高的水平 (如, 成百或更多),除非老數據對你很重要。因爲這會導致存儲文件變得極大。

6.4.2.  最小版本數

和行的最大版本數一樣,最小版本數也是通過HColumnDescriptor 在每個列族中設置的。最小版本數缺省值是0,表示該特性禁用。 最小版本數參數和存活時間一起使用,允許配置如“保存最後T秒有價值數據,最多N個版本,但最少約M個版本”(M是最小版本數,M<N)。 該參數僅在存活時間對列族啓用,且必須小於行版本數。

6.5.  支持數據類型

HBase 通過 Put 和 Result支持 "bytes-in/bytes-out" 接口,所以任何可被轉爲字節數組的東西可以作爲值存入。輸入可以是字符串,數字,複雜對象,甚至圖像,只要他們能轉爲字節。

存在值的實際長度限制 (如 保存 10-50MB 對象到 HBase 可能對查詢來說太長); 搜索郵件列表獲取本話題的對話。 HBase的所有行都遵循 Chapter 5, 數據模型, 包括版本化。 設計時需考慮到這些,以及列族的塊大小。

6.5.1. 計數器

一種支持的數據類型,值得一提的是“計數器”(如, 具有原子遞增能力的數值)。參考 HTable的 Increment .

同步計數器在區域服務器中完成,不是客戶端。

6.6. 聯合

如果有多個表,不要在模式設計中忘了 Section 5.11, “Joins” 的潛在因素。

6.7. 存活時間 (TTL)

列族可以設置TTL秒數,HBase 在超時後將自動刪除數據。影響 全部 行的全部版本 - 甚至當前版本。HBase裏面TTL 時間時區是 UTC.

參考 HColumnDescriptor 獲取更多信息。

6.8.  保留刪除的單元

列族允許是否保留單元。這就是說  Get 或 Scan 操作仍可以獲取刪除的單元。由於這些操作指定時間範圍,結束在刪除單元發生效果之前。這甚至允許在刪除進行時進行即時查詢。

刪除的單元仍然受TTL控制,並永遠不會超過“最大版本數”被刪除的單元。新 "raw" scan 選項返回所有已刪除的行和刪除標誌。

參考 HColumnDescriptor 獲取更多信息

6.9.  第二索引和改變路徑查詢

This section could also be titled "what if my table rowkey looks like this but I also want to query my table like that." A common example on the dist-list is where a row-key is of the format "user-timestamp" but there are are reporting requirements on activity across users for certain time ranges. Thus, selecting by user is easy because it is in the lead position of the key, but time is not.

There is no single answer on the best way to handle this because it depends on...

  • Number of users
  • Data size and data arrival rate
  • Flexibility of reporting requirements (e.g., completely ad-hoc date selection vs. pre-configured ranges)
  • Desired execution speed of query (e.g., 90 seconds may be reasonable to some for an ad-hoc report, whereas it may be too long for others)

... and solutions are also influenced by the size of the cluster and how much processing power you have to throw at the solution. Common techniques are in sub-sections below. This is a comprehensive, but not exhaustive, list of approaches.

It should not be a surprise that secondary indexes require additional cluster space and processing. This is precisely what happens in an RDBMS because the act of creating an alternate index requires both space and processing cycles to update. RBDMS products are more advanced in this regard to handle alternative index management out of the box. However, HBase scales better at larger data volumes, so this is a feature trade-off.

Pay attention to Chapter 11, Performance Tuning when implementing any of these approaches.

Additionally, see the David Butler response in this dist-list thread HBase, mail # user - Stargate+hbase

6.9.1.  過濾查詢

根據具體應用,可能和 Section 9.4, “Client Request Filters” 用法相當。在這種情況下,沒有第二索引被創建。然而,不要像這樣從應用 (如單線程客戶端)中對大表嘗試全表掃描。

6.9.2.  定期更新第二索引

第二索引可以在另一個表中創建,並通過MapReduce任務定期更新。任務可以在當天執行,但依賴於加載策略,可能會同主表失去同步。

參考 Section 7.2.2, “HBase MapReduce Read/Write Example” 獲取更多信息.

6.9.3.  雙寫第二索引

Another strategy is to build the secondary index while publishing data to the cluster (e.g., write to data table, write to index table). If this is approach is taken after a data table already exists, then bootstrapping will be needed for the secondary index with a MapReduce job (see Section 6.9.2, “ Periodic-Update Secondary Index ”).

6.9.4.  總結表(Summary Tables)

對時間跨度長 (e.g., 年報) 和數據量巨大,總結表是通用路徑。可通過MapReduce任務生成到另一個表。

參考 Section 7.2.4, “HBase MapReduce Summary to HBase Example” 獲取更多信息。

6.9.5.  協處理第二索引

協處理動作像 RDBMS 觸發器。 這在 0.92中添加. 更多參考 Section 9.6.3, “Coprocessors”

6.10. 模式(schema)設計對決

This section will describe common schema design questions that appear on the dist-list. These are general guidelines and not laws - each application must consider it's own needs.

6.10.1. Rows vs. Versions

A common question is whether one should prefer rows or HBase's built-in-versioning. The context is typically where there are "a lot" of versions of a row to be retained (e.g., where it is significantly above the HBase default of 3 max versions). The rows-approach would require storing a timstamp in some portion of the rowkey so that they would not overwite with each successive update.

Preference: Rows (generally speaking).

6.10.2. Rows vs. Columns

Another common question is whether one should prefer rows or columns. The context is typically in extreme cases of wide tables, such as having 1 row with 1 million attributes, or 1 million rows with 1 columns apiece.

Preference: Rows (generally speaking). To be clear, this guideline is in the context is in extremely wide cases, not in the standard use-case where one needs to store a few dozen or hundred columns.

6.11. 業務和性能配置選項

參考 the Performance section Section 11.6, “Schema Design” for more information operational and performance schema design options, such as Bloom Filters, Table-configured regionsizes, compression, and blocksizes.

6.12. 限制

HBase currently supports 'constraints' in traditional (SQL) database parlance. The advised usage for Constraints is in enforcing business rules for attributes in the table (eg. make sure values are in the range 1-10). Constraints could also be used to enforce referential integrity, but this is strongly discouraged as it will dramatically decrease the write throughput of the tables where integrity checking is enabled. Extensive documentation on using Constraints can be found at: Constraint since version 0.94.

 

 

 

Chapter 7. HBase 和 MapReduce

關於 HBase 和 MapReduce詳見 javadocs. 下面是一些附加的幫助文檔. MapReduce的更多信息 (如,通用框架), 參考 Hadoop MapReduce Tutorial.

7.1. Map-Task 分割

7.1.1 默認 HBase MapReduce 分割器(Splitter)

當 MapReduce 任務的HBase 表使用TableInputFormat爲數據源格式的時候,他的splitter會給這個table的每個region一個map。因此,如果一個table有100個region,就有100個map-tasks,不論需要scan多少個列族 。

7.1.2. 自定義分割器

For those interested in implementing custom splitters, see the method getSplits in TableInputFormatBase. That is where the logic for map-task assignment resides.

 

7.2. HBase MapReduce 例子

7.2.1 HBase MapReduce 讀取例子

下面是使用HBase 作爲源的MapReduce讀取示例。特別是僅有Mapper實例,沒有Reducer。Mapper什麼也不產生。

如下所示...

Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MyReadJob.class); // class that contains mapper

Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
// set other scan attrs
...

TableMapReduceUtil.initTableMapperJob(
tableName, // input HBase table name
scan, // Scan instance to control CF and attribute selection
MyMapper.class, // mapper
null, // mapper output key 
null, // mapper output value
job);
job.setOutputFormatClass(NullOutputFormat.class); // because we aren't emitting anything from mapper

boolean b = job.waitForCompletion(true);
if (!b) {
throw new IOException("error with job!");
}

...mapper需要繼承於TableMapper...

public class MyMapper extends TableMapper<Text, LongWritable> { public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException { // process data for the row from the Result instance.

7.2.2. HBase MapReduce 讀/寫 示例

下面是使用HBase 作爲源和目標的MapReduce示例. 本示例簡單從一個表複製到另一個表。

Configuration config = HBaseConfiguration.create(); Job job = new Job(config,"ExampleReadWrite"); job.setJarByClass(MyReadWriteJob.class); // class that contains mapper Scan scan = new Scan(); scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs scan.setCacheBlocks(false); // don't set to true for MR jobs // set other scan attrs TableMapReduceUtil.initTableMapperJob( sourceTable, // input table scan,	 // Scan instance to control CF and attribute selection MyMapper.class, // mapper class null,	 // mapper output key null,	 // mapper output value job); TableMapReduceUtil.initTableReducerJob( targetTable, // output table null, // reducer class job); job.setNumReduceTasks(0); boolean b = job.waitForCompletion(true); if (!b) { throw new IOException("error with job!"); }

TableMapReduceUtil做了什麼需要解釋, 特別是對 reducer. TableOutputFormat 作爲 outputFormat 類, 幾個參數在config中設置(e.g., TableOutputFormat.OUTPUT_TABLE), 同時設置reducer output key 到 ImmutableBytesWritable 和 reducer value到 Writable. 這可以編程時設置到job和conf,但TableMapReduceUtil 使其變簡單.

下面是 mapper示例, 創建一個 Put,匹配輸入的 Result 並提交. Note: 這是 CopyTable 工具做的.

public static class MyMapper extends TableMapper<ImmutableBytesWritable, Put> { public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException { // this example is just copying the data from the source table... context.write(row, resultToPut(row,value)); } private static Put resultToPut(ImmutableBytesWritable key, Result result) throws IOException { Put put = new Put(key.get()); for (KeyValue kv : result.raw()) { put.add(kv); } return put; } }

這不是真正的 reducer 步驟, 所以 TableOutputFormat 處理髮送 Put 到目標表.

這僅是示例, 開發者可以選擇不使用TableOutputFormat並自己連接到目標表。

7.2.3. HBase MapReduce Read/Write 多表輸出示例

TODO: MultiTableOutputFormat 示例.

7.2.4. HBase MapReduce 總結到 HBase 示例

下面是使用HBase 作爲源和目標的MapReduce示例,具有總結步驟。本示例計算一個表中值的個數,並將總結的計數輸出到另一個表。

Configuration config = HBaseConfiguration.create(); Job job = new Job(config,"ExampleSummary"); job.setJarByClass(MySummaryJob.class); // class that contains mapper and reducer Scan scan = new Scan(); scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs scan.setCacheBlocks(false); // don't set to true for MR jobs // set other scan attrs TableMapReduceUtil.initTableMapperJob( sourceTable, // input table scan, // Scan instance to control CF and attribute selection MyMapper.class, // mapper class Text.class, // mapper output key IntWritable.class, // mapper output value job); TableMapReduceUtil.initTableReducerJob( targetTable, // output table MyTableReducer.class, // reducer class job); job.setNumReduceTasks(1); // at least one, adjust as required boolean b = job.waitForCompletion(true); if (!b) { throw new IOException("error with job!"); }

本示例mapper,將一個列的一個字符串值作爲總結值。該值作爲key在mapper中生成。 IntWritable 代表一個實例計數。

public static class MyMapper extends TableMapper<Text, IntWritable> { private final IntWritable ONE = new IntWritable(1); private Text text = new Text(); public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException { String val = new String(value.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr1"))); text.set(val); // we can only emit Writables... context.write(text, ONE); } }

在 reducer, "ones" 被統計 (和其他 MR 示例一樣), 產生一個 Put.

public static class MyTableReducer extends TableReducer<Text, IntWritable, ImmutableBytesWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int i = 0; for (IntWritable val : values) { i += val.get(); } Put put = new Put(Bytes.toBytes(key.toString())); put.add(Bytes.toBytes("cf"), Bytes.toBytes("count"), Bytes.toBytes(i)); context.write(null, put); } }

7.2.5. HBase MapReduce 總結到文件示例

This very similar to the summary example above, with exception that this is using HBase as a MapReduce source but HDFS as the sink. The differences are in the job setup and in the reducer. The mapper remains the same.

Configuration config = HBaseConfiguration.create(); Job job = new Job(config,"ExampleSummaryToFile"); job.setJarByClass(MySummaryFileJob.class); // class that contains mapper and reducer Scan scan = new Scan(); scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs scan.setCacheBlocks(false); // don't set to true for MR jobs // set other scan attrs TableMapReduceUtil.initTableMapperJob( sourceTable, // input table scan, // Scan instance to control CF and attribute selection MyMapper.class, // mapper class Text.class, // mapper output key IntWritable.class, // mapper output value job); job.setReducerClass(MyReducer.class); // reducer class job.setNumReduceTasks(1); // at least one, adjust as required FileOutputFormat.setOutputPath(job, new Path("/tmp/mr/mySummaryFile")); // adjust directories as required boolean b = job.waitForCompletion(true); if (!b) { throw new IOException("error with job!"); }
As stated above, the previous Mapper can run unchanged with this example. As for the Reducer, it is a "generic" Reducer instead of extending TableMapper and emitting Puts.
public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int i = 0; for (IntWritable val : values) { i += val.get(); } context.write(key, new IntWritable(i)); } }

7.2.6. HBase MapReduce 無Reducer總結到 HBase

It is also possible to perform summaries without a reducer - if you use HBase as the reducer.

An HBase target table would need to exist for the job summary. The HTable method incrementColumnValue would be used to atomically increment values. From a performance perspective, it might make sense to keep a Map of values with their values to be incremeneted for each map-task, and make one update per key at during the cleanup method of the mapper. However, your milage may vary depending on the number of rows to be processed and unique keys.

In the end, the summary results are in HBase.

7.2.7. HBase MapReduce 總結到 RDBMS

有時更合適產生總結到 RDBMS.這種情況下,可以將總結直接通過一個自定義的reducer輸出到 RDBMS 。 setup 方法可以連接到 RDBMS (連接信息可以通過context的自定義參數傳遞), cleanup 可以關閉連接.

關鍵需要理解job的多個reducer會影響總結實現,必須在reducer中進行設計。無論是一個recucer還是多個reducer。不管對錯, 依賴於你的用例。認識到多個reducer分配到job,需要創建多個併發的RDBMS連接-可以擴充,但僅在一個點。

public static class MyRdbmsReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private Connection c = null; public void setup(Context context) { // create DB connection... } public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { // do summarization // in this example the keys are Text, but this is just an example } public void cleanup(Context context) { // close db connection } }

最後,總結的結果被寫入到 RDBMS 表.

7.3. 在一個MapReduce Job中訪問其他的HBase Tables

儘管現有的框架允許一個HBase table作爲一個MapReduce job的輸入,其他的HBase table可以同時作爲普通的表被訪問。例如在一個MapReduce的job中,可以在Mapper的setup方法中創建HTable實例。

public class MyMapper extends TableMapper<Text, LongWritable> { private HTable myOtherTable; @Override public void setup(Context context) { myOtherTable = new HTable("myOtherTable"); }

7.4. 預測執行

通常建議關掉針對HBase的MapReduce job的預測執行(speculative execution)功能。這個功能也可以用每個Job的配置來完成。對於整個集羣,使用預測執行意味着雙倍的運算量。這可不是你所希望的。

參考 Section 2.8.2.9, “Speculative Execution” for more information.

Chapter 8.  HBase安全

8.1. 安全客戶端訪問HBase

新版 HBase (>= 0.92) 支持客戶端可選 SASL 認證.

這裏描述如何設置HBase 和 HBase 客戶端,以安全連接到HBase 資源.

8.1.1. 先決條件

HBase 必須使用 安全Hadoop/HBase的新 maven 配置文件: -P security. Secure Hadoop dependent classes are separated under a pseudo-module in the security/ directory and are only included if built with the secure Hadoop profile.

You need to have a working Kerberos KDC.

A HBase configured for secure client access is expected to be running on top of a secured HDFS cluster. HBase must be able to authenticate to HDFS services. HBase needs Kerberos credentials to interact with the Kerberos-enabled HDFS daemons. Authenticating a service should be done using a keytab file. The procedure for creating keytabs for HBase service is the same as for creating keytabs for Hadoop. Those steps are omitted here. Copy the resulting keytab files to wherever HBase Master and RegionServer processes are deployed and make them readable only to the user account under which the HBase daemons will run.

A Kerberos principal has three parts, with the form username/[email protected]. We recommend using hbase as the username portion.

The following is an example of the configuration properties for Kerberos operation that must be added to the hbase-site.xml file on every server machine in the cluster. Required for even the most basic interactions with a secure Hadoop configuration, independent of HBase security.

<property> <name>hbase.regionserver.kerberos.principal</name> <value>hbase/[email protected]</value> </property> <property> <name>hbase.regionserver.keytab.file</name> <value>/etc/hbase/conf/keytab.krb5</value> </property> <property> <name>hbase.master.kerberos.principal</name> <value>hbase/[email protected]</value> </property> <property> <name>hbase.master.keytab.file</name> <value>/etc/hbase/conf/keytab.krb5</value> </property>

Each HBase client user should also be given a Kerberos principal. This principal should have a password assigned to it (as opposed to a keytab file). The client principal's maxrenewlife should be set so that it can be renewed enough times for the HBase client process to complete. For example, if a user runs a long-running HBase client process that takes at most 3 days, we might create this user's principal within kadmin with:addprinc -maxrenewlife 3days

Long running daemons with indefinite lifetimes that require client access to HBase can instead be configured to log in from a keytab. For each host running such daemons, create a keytab with kadmin or kadmin.local. The procedure for creating keytabs for HBase service is the same as for creating keytabs for Hadoop. Those steps are omitted here. Copy the resulting keytab files to where the client daemon will execute and make them readable only to the user account under which the daemon will run.

8.1.2. 安全操作的服務器端配置

增加下列內容到 hbase-site.xml file on every server machine in the cluster:

<property> <name>hbase.security.authentication</name> <value>kerberos</value> </property> <property> <name>hbase.security.authorization</name> <value>true</value> </property> <property> <name>hbase.rpc.engine</name> <value>org.apache.hadoop.hbase.ipc.SecureRpcEngine</value> </property> <property> <name>hbase.coprocessor.region.classes</name> <value>org.apache.hadoop.hbase.security.token.TokenProvider</value> </property>

A full shutdown and restart of HBase service is required when deploying these configuration changes.

8.1.3. 客戶端安全操作配置

每個客戶端增加下列內容到 hbase-site.xml:

<property> <name>hbase.security.authentication</name> <value>kerberos</value> </property> <property> <name>hbase.rpc.engine</name> <value>org.apache.hadoop.hbase.ipc.SecureRpcEngine</value> </property>

The client environment must be logged in to Kerberos from KDC or keytab via the kinit command before communication with the HBase cluster will be possible.

Be advised that if the hbase.security.authentication and hbase.rpc.engine properties in the client- and server-side site files do not match, the client will not be able to communicate with the cluster.

Once HBase is configured for secure RPC it is possible to optionally configure encrypted communication. To do so, 增加下列內容到 hbase-site.xmlfile on every client:

<property> <name>hbase.rpc.protection</name> <value>privacy</value> </property>

This configuration property can also be set on a per connection basis. Set it in the Configuration supplied to HTable:

Configuration conf = HBaseConfiguration.create(); conf.set("hbase.rpc.protection", "privacy"); HTable table = new HTable(conf, tablename);

Expect a ~10% performance penalty for encrypted communication.

8.1.4. 客戶端安全操作配置 - Thrift 網關

每個Thrift網關增加下列內容到 hbase-site.xml:

<property> <name>hbase.thrift.keytab.file</name> <value>/etc/hbase/conf/hbase.keytab</value> </property> <property> <name>hbase.thrift.kerberos.principal</name> <value>$USER/[email protected]</value> </property>

Substitute the appropriate credential and keytab for $USER and $KEYTAB respectively.

The Thrift gateway will authenticate with HBase using the supplied credential. No authentication will be performed by the Thrift gateway itself. All client access via the Thrift gateway will use the Thrift gateway's credential and have its privilege.

8.1.5. 客戶端安全操作配置 - REST 網關

每個REST網關增加下列內容到 hbase-site.xml:

<property> <name>hbase.rest.keytab.file</name> <value>$KEYTAB</value> </property> <property> <name>hbase.rest.kerberos.principal</name> <value>$USER/[email protected]</value> </property>

Substitute the appropriate credential and keytab for $USER and $KEYTAB respectively.

The REST gateway will authenticate with HBase using the supplied credential. No authentication will be performed by the REST gateway itself. All client access via the REST gateway will use the REST gateway's credential and have its privilege.

It should be possible for clients to authenticate with the HBase cluster through the REST gateway in a pass-through manner via SPEGNO HTTP authentication. This is future work.

8.2. 訪問控制

Newer releases of HBase (>= 0.92) support optional access control list (ACL-) based protection of resources on a column family and/or table basis.

This describes how to set up Secure HBase for access control, with an example of granting and revoking user permission on table resources provided.

8.2.1. 先決條件

You must configure HBase for secure operation. Refer to the section "安全客戶端訪問HBase" and complete all of the steps described there.

You must also configure ZooKeeper for secure operation. Changes to ACLs are synchronized throughout the cluster using ZooKeeper. Secure authentication to ZooKeeper must be enabled or otherwise it will be possible to subvert HBase access control via direct client access to ZooKeeper. Refer to the section on secure ZooKeeper configuration and complete all of the steps described there.

8.2.2. 概述

With Secure RPC and Access Control enabled, client access to HBase is authenticated and user data is private unless access has been explicitly granted. Access to data can be granted at a table or per column family basis.

However, the following items have been left out of the initial implementation for simplicity:

  1. Row-level or per value (cell): This would require broader changes for storing the ACLs inline with rows. It is a future goal.

  2. Push down of file ownership to HDFS: HBase is not designed for the case where files may have different permissions than the HBase system principal. Pushing file ownership down into HDFS would necessitate changes to core code. Also, while HDFS file ownership would make applying quotas easy, and possibly make bulk imports more straightforward, it is not clear that it would offer a more secure setup.

  3. HBase managed "roles" as collections of permissions: We will not model "roles" internally in HBase to begin with. We instead allow group names to be granted permissions, which allows external modeling of roles via group membership. Groups are created and manipulated externally to HBase, via the Hadoop group mapping service.

Access control mechanisms are mature and fairly standardized in the relational database world. The HBase implementation approximates current convention, but HBase has a simpler feature set than relational databases, especially in terms of client operations. We don't distinguish between an insert (new record) and update (of existing record), for example, as both collapse down into a Put. Accordingly, the important operations condense to four permissions: READ, WRITE, CREATE, and ADMIN.

Operation To Permission MappingPermissionOperationReadGetExistsScanWritePutDeleteLock/UnlockRowIncrementColumnValueCheckAndDelete/PutFlushCompactCreateCreateAlterDropAdminEnable/DisableSplitMajor CompactGrantRevokeShutdown

Permissions can be granted in any of the following scopes, though CREATE and ADMIN permissions are effective only at table scope.

  • Table

    • Read: User can read from any column family in table

    • Write: User can write to any column family in table

    • Create: User can alter table attributes; add, alter, or drop column families; and drop the table.

    • Admin: User can alter table attributes; add, alter, or drop column families; and enable, disable, or drop the table. User can also trigger region (re)assignments or relocation.

  • Column Family

    • Read: User can read from the column family

    • Write: User can write to the column family

There is also an implicit global scope for the superuser.

The superuser is a principal, specified in the HBase site configuration file, that has equivalent access to HBase as the 'root' user would on a UNIX derived system. Normally this is the principal that the HBase processes themselves authenticate as. Although future versions of HBase Access Control may support multiple superusers, the superuser privilege will always include the principal used to run the HMaster process. Only the superuser is allowed to create tables, switch the balancer on or off, or take other actions with global consequence. Furthermore, the superuser has an implicit grant of all permissions to all resources.

Tables have a new metadata attribute: OWNER, the user principal who owns the table. By default this will be set to the user principal who creates the table, though it may be changed at table creation time or during an alter operation by setting or changing the OWNER table attribute. Only a single user principal can own a table at a given time. A table owner will have all permissions over a given table.

8.2.3. Server-side Configuration for Access Control

Enable the AccessController coprocessor in the cluster configuration and restart HBase. The restart can be a rolling one. Complete the restart of all Master and RegionServer processes before setting up ACLs.

To enable the AccessController, modify the hbase-site.xml file on every server machine in the cluster to look like:

<property> <name>hbase.coprocessor.master.classes</name> <value>org.apache.hadoop.hbase.security.access.AccessController</value> </property> <property> <name>hbase.coprocessor.region.classes</name> <value>org.apache.hadoop.hbase.security.token.TokenProvider, org.apache.hadoop.hbase.security.access.AccessController</value> </property>

8.2.4. Shell Enhancements for Access Control

The HBase shell has been extended to provide simple commands for editing and updating user permissions. The following commands have been added for access control list management:

Grant

grant <user> <permissions> <table> [ <column family> [ <column qualifier> ] ]

<permissions> is zero or more letters from the set "RWCA": READ('R'), WRITE('W'), CREATE('C'), ADMIN('A').

Note: Grants and revocations of individual permissions on a resource are both accomplished using the grant command. A separate revoke command is also provided by the shell, but this is for fast revocation of all of a user's access rights to a given resource only.

Revoke

revoke <user> <table> [ <column family> [ <column qualifier> ] ]

Alter

The alter command has been extended to allow ownership assignment:

alter 'tablename', {OWNER => 'username'}

User Permission

The user_permission command shows all access permissions for the current user for a given table:

user_permission <table>

Chapter 9. 架構

9.1. 概述

9.1.1. NoSQL?

HBase是一種 "NoSQL" 數據庫. "NoSQL"是一個通用詞表示數據庫不是RDBMS ,後者支持 SQL 作爲主要訪問手段。有許多種 NoSQL 數據庫: BerkeleyDB 是本地 NoSQL 數據庫例子, 而 HBase 是大型分佈式數據庫。 技術上來說, HBase 更像是"數據存儲(Data Store)" 多於 "數據庫(Data Base)"。因爲缺少很多RDBMS特性, 如列類型,第二索引,觸發器,高級查詢語言等.

然而, HBase 有許多特徵同時支持線性化和模塊化擴充。 HBase 集羣通過增加RegionServers進行擴充。 它可以放在普通的服務器中。例如,如果集羣從10個擴充到20個RegionServer,存儲空間和處理容量都同時翻倍。 RDBMS 也能很好擴充, 但僅對一個點 - 特別是對一個單獨數據庫服務器的大小 - 同時,爲了更好的性能,需要特殊的硬件和存儲設備。 HBase 特性:

  • 強一致性讀寫: HBase 不是 "最終一致性(eventually consistent)" 數據存儲. 這讓它很適合高速計數聚合類任務。
  • 自動分片(Automatic sharding): HBase 表通過region分佈在集羣中。數據增長時,region會自動分割並重新分佈。
  • RegionServer 自動故障轉移
  • Hadoop/HDFS 集成: HBase 支持本機外HDFS 作爲它的分佈式文件系統。
  • MapReduce: HBase 通過MapReduce支持大併發處理, HBase 可以同時做源和目標.
  • Java 客戶端 API: HBase 支持易於使用的 Java API 進行編程訪問.
  • Thrift/REST API: HBase 也支持Thrift 和 REST 作爲非Java 前端.
  • Block Cache 和 Bloom Filters: 對於大容量查詢優化, HBase支持 Block Cache 和 Bloom Filters。
  • 運維管理: HBase提供內置網頁用於運維視角和JMX 度量.

9.1.2. 什麼時候用 HBase?

HBase不適合所有問題.

首先,確信有足夠多數據,如果有上億或上千億行數據,HBase是很好的備選。 如果只有上千或上百萬行,則用傳統的RDBMS可能是更好的選擇。因爲所有數據可以在一兩個節點保存,集羣其他節點可能閒置。

其次,確信可以不依賴所有RDBMS的額外特性 (e.g., 列數據類型, 第二索引, 事物,高級查詢語言等.) 一個建立在RDBMS上應用,如不能僅通過改變一個JDBC驅動移植到HBase。相對於移植, 需考慮從RDBMS 到 HBase是一次完全的重新設計。

第三, 確信你有足夠硬件。甚至 HDFS 在小於5個數據節點時,幹不好什麼事情 (根據如 HDFS 塊複製具有缺省值 3), 還要加上一個 NameNode.

HBase 能在單獨的筆記本上運行良好。但這應僅當成開發配置。

9.1.3.  HBase 和 Hadoop/HDFS 的區別?

HDFS 是分佈式文件系統,適合保存大文件。官方宣稱它並非普通用途文件系統,不提供文件的個別記錄的快速查詢。 另一方面,HBase基於HDFS且提供大表的記錄快速查找(和更新)。這有時可能引起概念混亂。 HBase 內部將數據放到索引好的 "存儲文件(StoreFiles)" ,以便高速查詢。存儲文件位於 HDFS中。參考Chapter 5,數據模型 和該章其他內容獲取更多HBase如何歸檔的信息。

9.2. 目錄表(Catalog Tables)

目錄表 -ROOT- 和 .META. 作爲 HBase 表存在。他們被HBase shell的 list 命令過濾掉了, 但他們和其他表一樣存在。

9.2.1. ROOT

-ROOT- 保存 .META. 表存在哪裏的蹤跡. -ROOT- 表結構如下:

Key:

  • .META. region key (.META.,,1)

Values:

  • info:regioninfo (序列化.META.的 HRegionInfo 實例 )
  • info:server ( 保存 .META.的RegionServer的server:port)
  • info:serverstartcode ( 保存 .META.的RegionServer進程的啓動時間)

9.2.2. META

.META. 保存系統中所有region列表。 .META.表結構如下:

Key:

  • Region key 格式 ([table],[region start key],[region id])

Values:

  • info:regioninfo (序列化.META.的 HRegionInfo 實例 )
  • info:server ( 保存 .META.的RegionServer的server:port)
  • info:serverstartcode ( 保存 .META.的RegionServer進程的啓動時間)

當表在分割過程中,會創建額外的兩列, info:splitA 和 info:splitB 代表兩個女兒 region. 這兩列的值同樣是序列化HRegionInfo 實例. region最終分割完畢後,這行會刪除。

HRegionInfo的備註: 空 key 用於指示表的開始和結束。具有空開始鍵值的region是表內的首region。 如果 region 同時有空起始和結束key,說明它是表內的唯一region。

在需要編程訪問(希望不要)目錄元數據時,參考 Writables 工具.

9.2.3. 啓動時序

META 地址首先在ROOT 中設置。META 會更新 server 和 startcode 的值.

需要 region-RegionServer 分配信息, 參考 Section 9.7.2, “Region-RegionServer 分配”.

9.3. 客戶端

HBase客戶端的 HTable類負責尋找相應的RegionServers來處理行。他是先查詢 .META. 和 -ROOT 目錄表。然後再確定region的位置。定位到所需要的區域後,客戶端會直接 去訪問相應的region(不經過master),發起讀寫請求。這些信息會緩存在客戶端,這樣就不用每發起一個請求就去查一下。如果一個region已經廢棄(原因可能是master load balance或者RegionServer死了),客戶端就會重新進行這個步驟,決定要去訪問的新的地址。

參考 Section 9.5.2, “Runtime Impact” for more information about the impact of the Master on HBase Client communication.

管理集羣操作是經由HBaseAdmin發起的

9.3.1. 連接

關於連接的配置信息,參見Section 3.7, “連接HBase集羣的客戶端配置和依賴”.

HTable不是線程安全的。建議使用同一個HBaseConfiguration實例來創建HTable實例。這樣可以共享ZooKeeper和socket實例。例如,最好這樣做:

HBaseConfiguration conf = HBaseConfiguration.create(); HTable table1 = new HTable(conf, "myTable"); HTable table2 = new HTable(conf, "myTable");

而不是這樣:

HBaseConfiguration conf1 = HBaseConfiguration.create(); HTable table1 = new HTable(conf1, "myTable"); HBaseConfiguration conf2 = HBaseConfiguration.create(); HTable table2 = new HTable(conf2, "myTable");

如果你想知道的更多的關於HBase客戶端connection的知識,可以參照: HConnectionManager.

9.3.1.1. 連接池

對需要高端多線程訪問的應用 (如網頁服務器或應用服務器需要在一個JVM服務很多應用線程),參考 HTablePool.

 

9.3.2. 寫緩衝和批量操作

若關閉了HTable中的 Section 11.7.4, “AutoFlush”Put操作會在寫緩衝填滿的時候向RegionServer發起請求。默認情況下,寫緩衝是2MB.在Htable被廢棄之前,要調用close()flushCommits()操作,這樣寫緩衝就不會丟失。

要想更好的細粒度控制 PutDelete的批量操作,可以參考Htable中的batch 方法.

9.3.3. 外部客戶端

關於非Java客戶端和定製協議信息,在 Chapter 10, 外部 API

9.3.4. 行鎖

行鎖 在客戶端 API中仍然存在, 但是 不鼓勵使用,因爲管理不好,會鎖定整個RegionServer.

這裏是優秀ticket HBASE-2332 從終端移除該特性。

 

9.4. 客戶端請求過濾器

Get 和 Scan 實例可以用 filters 配置,以應用於 RegionServer.

過濾器可能會搞混,因爲有很多類型的過濾器, 最好通過理解過濾器功能組來了解他們。

9.4.1. 結構(Structural)過濾器

結構過濾器包含其他過濾器

9.4.1.1. FilterList

FilterList 代表一個過濾器列表,過濾器間具有 FilterList.Operator.MUST_PASS_ALL 或 FilterList.Operator.MUST_PASS_ONE 關係。下面示例展示兩個過濾器的'或'關係(檢查同一屬性的'my value' 或'my other value' ).

FilterList list = new FilterList(FilterList.Operator.MUST_PASS_ONE); SingleColumnValueFilter filter1 = new SingleColumnValueFilter( cf, column, CompareOp.EQUAL, Bytes.toBytes("my value") ); list.add(filter1); SingleColumnValueFilter filter2 = new SingleColumnValueFilter( cf, column, CompareOp.EQUAL, Bytes.toBytes("my other value") ); list.add(filter2); scan.setFilter(list);

9.4.2. 列值

9.4.2.1. SingleColumnValueFilter

SingleColumnValueFilter 用於測試列值相等 (CompareOp.EQUAL ), 不等 (CompareOp.NOT_EQUAL),或範圍 (e.g., CompareOp.GREATER). 下面示例檢查列值和字符串'my value' 相等...

SingleColumnValueFilter filter = new SingleColumnValueFilter( cf, column, CompareOp.EQUAL, Bytes.toBytes("my value") ); scan.setFilter(filter);

9.4.3. 列值比較器

過濾器包內有好幾種比較器類需要特別提及。這些比較器和其他過濾器一起使用, 如 Section 9.4.2.1, “SingleColumnValueFilter”.

9.4.3.1. RegexStringComparator

RegexStringComparator 支持值比較的正則表達式 。

RegexStringComparator comp = new RegexStringComparator("my."); // any value that starts with 'my' SingleColumnValueFilter filter = new SingleColumnValueFilter( cf, column, CompareOp.EQUAL, comp ); scan.setFilter(filter);

參考 Oracle JavaDoc 瞭解 supported RegEx patterns in Java.

9.4.3.2. SubstringComparator

SubstringComparator 用於檢測一個子串是否存在於值中。大小寫不敏感。

SubstringComparator comp = new SubstringComparator("y val"); // looking for 'my value' SingleColumnValueFilter filter = new SingleColumnValueFilter( cf, column, CompareOp.EQUAL, comp ); scan.setFilter(filter);

9.4.3.3. BinaryPrefixComparator

參考 BinaryPrefixComparator.

9.4.3.4. BinaryComparator

參考 BinaryComparator.

9.4.4. 鍵值元數據

由於HBase 採用鍵值對保存內部數據,鍵值元數據過濾器評估一行的鍵是否存在(如 ColumnFamily:Column qualifiers) , 對應前節所述值的情況。

9.4.4.1. FamilyFilter

FamilyFilter 用於過濾列族。 通常,在Scan中選擇ColumnFamilie優於在過濾器中做。

9.4.4.2. QualifierFilter

QualifierFilter 用於基於列名(即 Qualifier)過濾.

9.4.4.3. ColumnPrefixFilter

ColumnPrefixFilter 可基於列名(即Qualifier)前綴過濾。

A ColumnPrefixFilter seeks ahead to the first column matching the prefix in each row and for each involved column family. It can be used to efficiently get a subset of the columns in very wide rows.

Note: The same column qualifier can be used in different column families. This filter returns all matching columns.

Example: Find all columns in a row and family that start with "abc"

HTableInterface t = ...; byte[] row = ...; byte[] family = ...; byte[] prefix = Bytes.toBytes("abc"); Scan scan = new Scan(row, row); // (optional) limit to one row scan.addFamily(family); // (optional) limit to one family Filter f = new ColumnPrefixFilter(prefix); scan.setFilter(f); scan.setBatch(10); // set this if there could be many columns returned ResultScanner rs = t.getScanner(scan); for (Result r = rs.next(); r != null; r = rs.next()) { for (KeyValue kv : r.raw()) { // each kv represents a column } } rs.close();

9.4.4.4. MultipleColumnPrefixFilter

MultipleColumnPrefixFilter 和 ColumnPrefixFilter 行爲差不多,但可以指定多個前綴。

Like ColumnPrefixFilter, MultipleColumnPrefixFilter efficiently seeks ahead to the first column matching the lowest prefix and also seeks past ranges of columns between prefixes. It can be used to efficiently get discontinuous sets of columns from very wide rows.

Example: Find all columns in a row and family that start with "abc" or "xyz"

HTableInterface t = ...; byte[] row = ...; byte[] family = ...; byte[][] prefixes = new byte[][] {Bytes.toBytes("abc"), Bytes.toBytes("xyz")}; Scan scan = new Scan(row, row); // (optional) limit to one row scan.addFamily(family); // (optional) limit to one family Filter f = new MultipleColumnPrefixFilter(prefixes); scan.setFilter(f); scan.setBatch(10); // set this if there could be many columns returned ResultScanner rs = t.getScanner(scan); for (Result r = rs.next(); r != null; r = rs.next()) { for (KeyValue kv : r.raw()) { // each kv represents a column } } rs.close();

9.4.4.5. ColumnRangeFilter

ColumnRangeFilter 可以進行高效內部掃描。

A ColumnRangeFilter can seek ahead to the first matching column for each involved column family. It can be used to efficiently get a 'slice' of the columns of a very wide row. i.e. you have a million columns in a row but you only want to look at columns bbbb-bbdd.

Note: The same column qualifier can be used in different column families. This filter returns all matching columns.

Example: Find all columns in a row and family between "bbbb" (inclusive) and "bbdd" (inclusive)

HTableInterface t = ...; byte[] row = ...; byte[] family = ...; byte[] startColumn = Bytes.toBytes("bbbb"); byte[] endColumn = Bytes.toBytes("bbdd"); Scan scan = new Scan(row, row); // (optional) limit to one row scan.addFamily(family); // (optional) limit to one family Filter f = new ColumnRangeFilter(startColumn, true, endColumn, true); scan.setFilter(f); scan.setBatch(10); // set this if there could be many columns returned ResultScanner rs = t.getScanner(scan); for (Result r = rs.next(); r != null; r = rs.next()) { for (KeyValue kv : r.raw()) { // each kv represents a column } } rs.close();

Note: HBase 0.92 引入

9.4.5. RowKey

9.4.5.1. RowFilter

通常認爲行選擇時Scan採用 startRow/stopRow 方法比較好。然而 RowFilter 也可以用。

9.4.6. Utility

9.4.6.1. FirstKeyOnlyFilter

This is primarily used for rowcount jobs. 參考 FirstKeyOnlyFilter.

9.5. 主服務器

HMaster is the implementation of the Master Server. The Master server is responsible for monitoring all RegionServer instances in the cluster, and is the interface for all metadata changes. In a distributed cluster, the Master typically runs on the Section 9.9.1, “NameNode”.

9.5.1. Startup Behavior

If run in a multi-Master environment, all Masters compete to run the cluster. If the active Master loses it's lease in ZooKeeper (or the Master shuts down), then then the remaining Masters jostle to take over the Master role.

9.5.2. Runtime Impact

A common dist-list question is what happens to an HBase cluster when the Master goes down. Because the HBase client talks directly to the RegionServers, the cluster can still function in a "steady state." Additionally, per Section 9.2, “Catalog Tables” ROOT and META exist as HBase tables (i.e., are not resident in the Master). However, the Master controls critical functions such as RegionServer failover and completing region splits. So while the cluster can still run for a time without the Master, the Master should be restarted as soon as possible.

9.5.3. Interface

The methods exposed by HMasterInterface are primarily metadata-oriented methods:

  • Table (createTable, modifyTable, removeTable, enable, disable)
  • ColumnFamily (addColumn, modifyColumn, removeColumn)
  • Region (move, assign, unassign)

For example, when the HBaseAdmin method disableTable is invoked, it is serviced by the Master server.

9.5.4. 進程

Master 後臺運行幾種線程:

9.5.4.1. LoadBalancer

Periodically, and when there are not any regions in transition, a load balancer will run and move regions around to balance cluster load. 參考Section 2.8.3.1, “Balancer” for configuring this property.

參考 Section 9.7.2, “Region-RegionServer Assignment” for more information on region assignment.

9.5.4.2. CatalogJanitor

Periodically checks and cleans up the .META. table. 參考 Section 9.2.2, “META” for more information on META.

9.6. RegionServer

HRegionServer is the RegionServer implementation. It is responsible for serving and managing regions. In a distributed cluster, a RegionServer runs on a Section 9.9.2, “DataNode”.

9.6.1. 接口

The methods exposed by HRegionRegionInterface contain both data-oriented and region-maintenance methods:

  • Data (get, put, delete, next, etc.)
  • Region (splitRegion, compactRegion, etc.)

For example, when the HBaseAdmin method majorCompact is invoked on a table, the client is actually iterating through all regions for the specified table and requesting a major compaction directly to each region.

9.6.2. 進程

RegionServer 後臺運行幾種線程:

9.6.2.1. CompactSplitThread

檢查分割並處理最小壓縮。

9.6.2.2. MajorCompactionChecker

檢查主壓縮。

9.6.2.3. MemStoreFlusher

週期將寫到內存存儲的內容刷到文件存儲。

9.6.2.4. LogRoller

週期檢查RegionServer 的 HLog.

9.6.3. 協處理器

協處理器在0.92版添加。 有一個詳細帖子 Blog Overview of CoProcessors 供參考。文檔最終會放到本參考手冊,但該blog是當前能獲取的大部分信息。

9.6.4. 塊緩存

9.6.4.1. 設計

The Block Cache is an LRU cache that contains three levels of block priority to allow for scan-resistance and in-memory ColumnFamilies:

  • Single access priority: The first time a block is loaded from HDFS it normally has this priority and it will be part of the first group to be considered during evictions. The advantage is that scanned blocks are more likely to get evicted than blocks that are getting more usage.
  • Mutli access priority: If a block in the previous priority group is accessed again, it upgrades to this priority. It is thus part of the second group considered during evictions.
  • In-memory access priority: If the block's family was configured to be "in-memory", it will be part of this priority disregarding the number of times it was accessed. Catalog tables are configured like this. This group is the last one considered during evictions.

For more information, see the LruBlockCache source

9.6.4.2. 使用

Block caching is enabled by default for all the user tables which means that any read operation will load the LRU cache. This might be good for a large number of use cases, but further tunings are usually required in order to achieve better performance. An important concept is the working set size, or WSS, which is: "the amount of memory needed to compute the answer to a problem". For a website, this would be the data that's needed to answer the queries over a short amount of time.

The way to calculate how much memory is available in HBase for caching is:

number of region servers * heap size * hfile.block.cache.size * 0.85

The default value for the block cache is 0.25 which represents 25% of the available heap. The last value (85%) is the default acceptable loading factor in the LRU cache after which eviction is started. The reason it is included in this equation is that it would be unrealistic to say that it is possible to use 100% of the available memory since this would make the process blocking from the point where it loads new blocks. Here are some examples:

  • One region server with the default heap size (1GB) and the default block cache size will have 217MB of block cache available.
  • 20 region servers with the heap size set to 8GB and a default block cache size will have 34GB of block cache.
  • 100 region servers with the heap size set to 24GB and a block cache size of 0.5 will have about 1TB of block cache.

Your data isn't the only resident of the block cache, here are others that you may have to take into account:

  • Catalog tables: The -ROOT- and .META. tables are forced into the block cache and have the in-memory priority which means that they are harder to evict. The former never uses more than a few hundreds of bytes while the latter can occupy a few MBs (depending on the number of regions).
  • HFiles indexes: HFile is the file format that HBase uses to store data in HDFS and it contains a multi-layered index in order seek to the data without having to read the whole file. The size of those indexes is a factor of the block size (64KB by default), the size of your keys and the amount of data you are storing. For big data sets it's not unusual to see numbers around 1GB per region server, although not all of it will be in cache because the LRU will evict indexes that aren't used.
  • Keys: Taking into account only the values that are being stored is missing half the picture since every value is stored along with its keys (row key, family, qualifier, and timestamp). 參考 Section 6.3.2, “Try to minimize row and column sizes”.
  • Bloom filters: Just like the HFile indexes, those data structures (when enabled) are stored in the LRU.

Currently the recommended way to measure HFile indexes and bloom filters sizes is to look at the region server web UI and checkout the relevant metrics. For keys, sampling can be done by using the HFile command line tool and look for the average key size metric.

It's generally bad to use block caching when the WSS doesn't fit in memory. This is the case when you have for example 40GB available across all your region servers' block caches but you need to process 1TB of data. One of the reasons is that the churn generated by the evictions will trigger more garbage collections unnecessarily. Here are two use cases:

  • Fully random reading pattern: This is a case where you almost never access the same row twice within a short amount of time such that the chance of hitting a cached block is close to 0. Setting block caching on such a table is a waste of memory and CPU cycles, more so that it will generate more garbage to pick up by the JVM. For more information on monitoring GC, see Section 12.2.3, “JVM Garbage Collection Logs”.
  • Mapping a table: In a typical MapReduce job that takes a table in input, every row will be read only once so there's no need to put them into the block cache. The Scan object has the option of turning this off via the setCaching method (set it to false). You can still keep block caching turned on on this table if you need fast random read access. An example would be counting the number of rows in a table that serves live traffic, caching every block of that table would create massive churn and would surely evict data that's currently in use.

9.6.5. 預寫日誌 (WAL)

9.6.5.1. Purpose

每個RegionServer會將更新(Puts, Deletes) 先記錄到預寫日誌中(WAL),然後將其更新在Section 9.7.5, “Store”Section 9.7.5.1, “MemStore”裏面。這樣就保證了HBase的寫的可靠性。如果沒有WAL,當RegionServer宕掉的時候,MemStore還沒有flush,StoreFile還沒有保存,數據就會丟失。HLog 是HBase的一個WAL實現,一個RegionServer有一個HLog實例。

WAL 保存在HDFS 的 /hbase/.logs/ 裏面,每個region一個文件。

要想知道更多的信息,可以訪問維基百科 Write-Ahead Log 的文章.

 

9.6.5.2. WAL Flushing

TODO (describe).

9.6.5.3. WAL Splitting

9.6.5.3.1. 當RegionServer宕掉的時候,如何恢復

When a RegionServer crashes, it will lose its ephemeral lease in ZooKeeper...TODO

9.6.5.3.2. hbase.hlog.split.skip.errors

默認設置爲 true,在split執行中發生的任何錯誤會被記錄,有問題的WAL會被移動到HBase rootdir目錄下的.corrupt目錄,接着進行處理。如果設置爲 false,異常會被拋出,split會記錄錯誤。[23]

9.6.5.3.3. 如何處理一個髮生在當RegionServers' WALs 分割時候的EOFExceptions異常

如果我們在分割日誌的時候發生EOF,就是hbase.hlog.split.skip.errors設置爲 false,我們也會進行處理。一個EOF會發生在一行一行讀取Log,但是Log中最後一行似乎只寫了一半就停止了。如果在處理過程中發生了EOF,我們還會繼續處理,除非這個文件是要處理的最後一個文件。[24]



[23參考 HBASE-2958 When hbase.hlog.split.skip.errors is set to false, we fail the split but thats it. We need to do more than just fail split if this flag is set.

[24要想知道背景知識, 參見HBASE-2643 Figure how to deal with eof splitting logs

9.7. 區域

區域是表獲取和分佈的基本元素,由每個列族的一個庫(Store)組成。對象層級圖如下:

Table (HBase table) Region (Regions for the table) Store (Store per ColumnFamily for each Region for the table) MemStore (MemStore for each Store for each Region for the table) StoreFile(StoreFiles for each Store for each Region for the table) Block (Blocks within a StoreFile within a Store for each Region for the table)

關於HBase文件寫到HDFS的描述,參考 Section 12.7.2, “瀏覽 HDFS的 HBase 對象”.

9.7.1. Region 大小

Region的大小是一個棘手的問題,需要考量如下幾個因素。

  • Regions是可用性和分佈式的最基本單位

  • HBase通過將region切分在許多機器上實現分佈式。也就是說,你如果有16GB的數據,只分了2個region, 你卻有20臺機器,有18臺就浪費了。

  • region數目太多就會造成性能下降,現在比以前好多了。但是對於同樣大小的數據,700個region比3000個要好。

  • region數目太少就會妨礙可擴展性,降低並行能力。有的時候導致壓力不夠分散。這就是爲什麼,你向一個10節點的HBase集羣導入200MB的數據,大部分的節點是idle的。

  • RegionServer中1個region和10個region索引需要的內存量沒有太多的差別。

最好是使用默認的配置,可以把熱的表配小一點(或者受到split熱點的region把壓力分散到集羣中)。如果你的cell的大小比較大(100KB或更大),就可以把region的大小調到1GB。

參考 Section 2.5.2.6, “更大區域” 獲取配置更多信息.

9.7.2. 區域-區域服務器分配

本節描述區域如何分配到區域服務器。

9.7.2.1. 啓動

當HBase啓動時,區域分配如下(短版本):

  1. 啓動時主服務器調用AssignmentManager.
  2. AssignmentManager 在META 中查找已經存在的區域分配。
  3. 如果區域分配還有效(如 RegionServer 還在線) ,那麼分配繼續保持。
  4. 如果區域分配失效,LoadBalancerFactory 被調用來分配區域。 DefaultLoadBalancer 將隨機分配區域到RegionServer.
  5. META 隨 RegionServer 分配更新(如果需要) , RegionServer 啓動區域開啓代碼(RegionServer 啓動時進程)

9.7.2.2. 故障轉移

當區域服務器出故障退出時 (短版本):

  1. 區域立即不可獲取,因爲區域服務器退出。
  2. 主服務器會檢測到區域服務器退出。
  3. 區域分配會失效並被重新分配,如同啓動時序。

9.7.2.3. 區域負載均衡

區域可以定期移動,見 Section 9.5.4.1, “LoadBalancer”.

9.7.3. 區域-區域服務器本地化

Over time, Region-RegionServer locality is achieved via HDFS block replication. The HDFS client does the following by default when choosing locations to write replicas:

  1. First replica is written to local node
  2. Second replica is written to another node in same rack
  3. Third replica is written to a node in another rack (if sufficient nodes)

Thus, HBase eventually achieves locality for a region after a flush or a compaction. In a RegionServer failover situation a RegionServer may be assigned regions with non-local StoreFiles (because none of the replicas are local), however as new data is written in the region, or the table is compacted and StoreFiles are re-written, they will become "local" to the RegionServer.

For more information, see HDFS Design on Replica Placement and also Lars George's blog on HBase and HDFS locality.

9.7.4. 區域分割

區域服務器的分割操作是不可見的,因爲Master不會參與其中。區域服務器切割region的步驟是,先將該region下線,然後切割,將其子region加入到META元信息中,再將他們加入到原本的區域服務器中,最後彙報Master.參見Section 2.8.2.7, “管理 Splitting”來手動管理切割操作(以及爲何這麼做)。

9.7.4.1. 自定義分割策略

缺省分割策略可以被重寫,採用自定義RegionSplitPolicy (HBase 0.94+).一般自定義分割策略應該擴展HBase的缺省分割策略: ConstantSizeRegionSplitPolicy.

策略可以HBaseConfiguration 全局使用,或基於每張表:

HTableDescriptor myHtd = ...; myHtd.setValue(HTableDescriptor.SPLIT_POLICY, MyCustomSplitPolicy.class.getName());

9.7.5. 存儲

一個存儲包含了一個內存存儲(MemStore)和若干個文件存儲(StoreFile--HFile).一個存儲可以定位到一個列族中的一個區.

9.7.5.1. MemStore

MemStores是Store中的內存Store,可以進行修改操作。修改的內容是KeyValues。當flush的是,現有的memstore會生成快照,然後清空。在執行快照的時候,HBase會繼續接收修改操作,保存在memstore外面,直到快照完成。

9.7.5.2. StoreFile (HFile)

文件存儲是數據存在的地方。

9.7.5.2.1. HFile Format

hfile文件格式是基於BigTable [2006]論文中的SSTable。構建在Hadoop的tfile上面(直接使用了tfile的單元測試和壓縮工具)。 Schubert Zhang 的博客HFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs詳細介紹了HBases的hfile。Matteo Bertozzi也做了詳細的介紹HBase I/O: HFile

For more information, see the HFile source code. Also see Appendix E, HFile format version 2 for information about the HFile v2 format that was included in 0.92.

9.7.5.2.2. HFile 工具

要想看到hfile內容的文本化版本,你可以使用org.apache.hadoop.hbase.io.hfile.HFile 工具。可以這樣用:

$ ${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.io.hfile.HFile

例如,你想看文件 hdfs://10.81.47.41:9000/hbase/TEST/1418428042/DSMP/4759508618286845475的內容, 就執行如下的命令:

$ ${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.io.hfile.HFile -v -f hdfs://10.81.47.41:9000/hbase/TEST/1418428042/DSMP/4759508618286845475

如果你沒有輸入-v,就僅僅能看到一個hfile的彙總信息。其他功能的用法可以看HFile的文檔。

9.7.5.2.3. StoreFile Directory Structure on HDFS

For more information of what StoreFiles look like on HDFS with respect to the directory structure, see Section 12.7.2, “Browsing HDFS for HBase Objects”.

9.7.5.3. Blocks

StoreFiles are composed of blocks. The blocksize is configured on a per-ColumnFamily basis.

Compression happens at the block level within StoreFiles. For more information on compression, see Appendix C, Compression In HBase.

For more information on blocks, see the HFileBlock source code.

9.7.5.4. KeyValue

The KeyValue class is the heart of data storage in HBase. KeyValue wraps a byte array and takes offsets and lengths into passed array at where to start interpreting the content as KeyValue.

The KeyValue format inside a byte array is:

  • keylength
  • valuelength
  • key
  • value

The Key is further decomposed as:

  • rowlength
  • row (i.e., the rowkey)
  • columnfamilylength
  • columnfamily
  • columnqualifier
  • timestamp
  • keytype (e.g., Put, Delete, DeleteColumn, DeleteFamily)

KeyValue instances are not split across blocks. For example, if there is an 8 MB KeyValue, even if the block-size is 64kb this KeyValue will be read in as a coherent block. For more information, see the KeyValue source code.

9.7.5.4.1. Example

To emphasize the points above, examine what happens with two Puts for two different columns for the same row:

  • Put #1: rowkey=row1, cf:attr1=value1
  • Put #2: rowkey=row1, cf:attr2=value2

Even though these are for the same row, a KeyValue is created for each column:

Key portion for Put #1:

  • rowlength ------------> 4
  • row -----------------> row1
  • columnfamilylength ---> 2
  • columnfamily --------> cf
  • columnqualifier ------> attr1
  • timestamp -----------> server time of Put
  • keytype -------------> Put

Key portion for Put #2:

  • rowlength ------------> 4
  • row -----------------> row1
  • columnfamilylength ---> 2
  • columnfamily --------> cf
  • columnqualifier ------> attr2
  • timestamp -----------> server time of Put
  • keytype -------------> Put

It is critical to understand that the rowkey, ColumnFamily, and column (aka columnqualifier) are embedded within the KeyValue instance. The longer these identifiers are, the bigger the KeyValue is.

9.7.5.5. 壓縮

有兩種類型的壓縮:次壓縮和主壓縮。minor壓縮通常會將數個小的相鄰的文件合併成一個大的。Minor不會刪除打上刪除標記的數據,也不會刪除過期的數據,Major壓縮會刪除過期的數據。有些時候minor壓縮就會將一個store中的全部文件壓縮,實際上這個時候他本身就是一個major壓‹縮。對於一個minor壓縮是如何壓縮的,可以參見ascii diagram in the Store source code.

在執行一個major壓縮之後,一個store只會有一個sotrefile,通常情況下這樣可以提供性能。注意:major壓縮將會將store中的數據全部重寫,在一個負載很大的系統中,這個操作是很傷的。所以在大型系統中,通常會自己Section 2.8.2.8, “管理 Splitting”

壓縮 不會 進行分區合併。參考 Section 14.2.2, “Merge” 獲取更多合併的信息。

9.7.5.5.1. Compaction File Selection

To understand the core algorithm for StoreFile selection, there is some ASCII-art in the Store source code that will serve as useful reference. It has been copied below:

/* normal skew: * * older ----> newer * _ * | | _ * | | | | _ * --|-|- |-|- |-|---_-------_------- minCompactSize * | | | | | | | | _ | | * | | | | | | | | | | | | * | | | | | | | | | | | | */

Important knobs:

  • hbase.store.compaction.ratio Ratio used in compaction file selection algorithm (default 1.2f).
  • hbase.hstore.compaction.min (.90 hbase.hstore.compactionThreshold) (files) Minimum number of StoreFiles per Store to be selected for a compaction to occur (default 2).
  • hbase.hstore.compaction.max (files) Maximum number of StoreFiles to compact per minor compaction (default 10).
  • hbase.hstore.compaction.min.size (bytes) Any StoreFile smaller than this setting with automatically be a candidate for compaction. Defaults to hbase.hregion.memstore.flush.size (128 mb).
  • hbase.hstore.compaction.max.size (.92) (bytes) Any StoreFile larger than this setting with automatically be excluded from compaction (default Long.MAX_VALUE).

The minor compaction StoreFile selection logic is size based, and selects a file for compaction when the file <= sum(smaller_files) *hbase.hstore.compaction.ratio.

9.7.5.5.2. Minor Compaction File Selection - Example #1 (Basic Example)

This example mirrors an example from the unit test TestCompactSelection.

  • hbase.store.compaction.ratio = 1.0f
  • hbase.hstore.compaction.min = 3 (files)
  • hbase.hstore.compaction.max = 5 (files)
  • hbase.hstore.compaction.min.size = 10 (bytes)
  • hbase.hstore.compaction.max.size = 1000 (bytes)

The following StoreFiles exist: 100, 50, 23, 12, and 12 bytes apiece (oldest to newest). With the above parameters, the files that would be selected for minor compaction are 23, 12, and 12.

Why?

  • 100 --> No, because sum(50, 23, 12, 12) * 1.0 = 97.
  • 50 --> No, because sum(23, 12, 12) * 1.0 = 47.
  • 23 --> Yes, because sum(12, 12) * 1.0 = 24.
  • 12 --> Yes, because the previous file has been included, and because this does not exceed the the max-file limit of 5
  • 12 --> Yes, because the previous file had been included, and because this does not exceed the the max-file limit of 5.

9.7.5.5.3. Minor Compaction File Selection - Example #2 (Not Enough Files To Compact)

This example mirrors an example from the unit test TestCompactSelection.

  • hbase.store.compaction.ratio = 1.0f
  • hbase.hstore.compaction.min = 3 (files)
  • hbase.hstore.compaction.max = 5 (files)
  • hbase.hstore.compaction.min.size = 10 (bytes)
  • hbase.hstore.compaction.max.size = 1000 (bytes)

The following StoreFiles exist: 100, 25, 12, and 12 bytes apiece (oldest to newest). With the above parameters, the files that would be selected for minor compaction are 23, 12, and 12.

Why?

  • 100 --> No, because sum(25, 12, 12) * 1.0 = 47
  • 25 --> No, because sum(12, 12) * 1.0 = 24
  • 12 --> No. Candidate because sum(12) * 1.0 = 12, there are only 2 files to compact and that is less than the threshold of 3
  • 12 --> No. Candidate because the previous StoreFile was, but there are not enough files to compact

9.7.5.5.4. Minor Compaction File Selection - Example #3 (Limiting Files To Compact)

This example mirrors an example from the unit test TestCompactSelection.

  • hbase.store.compaction.ratio = 1.0f
  • hbase.hstore.compaction.min = 3 (files)
  • hbase.hstore.compaction.max = 5 (files)
  • hbase.hstore.compaction.min.size = 10 (bytes)
  • hbase.hstore.compaction.max.size = 1000 (bytes)

The following StoreFiles exist: 7, 6, 5, 4, 3, 2, and 1 bytes apiece (oldest to newest). With the above parameters, the files that would be selected for minor compaction are 7, 6, 5, 4, 3.

Why?

  • 7 --> Yes, because sum(6, 5, 4, 3, 2, 1) * 1.0 = 21. Also, 7 is less than the min-size
  • 6 --> Yes, because sum(5, 4, 3, 2, 1) * 1.0 = 15. Also, 6 is less than the min-size.
  • 5 --> Yes, because sum(4, 3, 2, 1) * 1.0 = 10. Also, 5 is less than the min-size.
  • 4 --> Yes, because sum(3, 2, 1) * 1.0 = 6. Also, 4 is less than the min-size.
  • 3 --> Yes, because sum(2, 1) * 1.0 = 3. Also, 3 is less than the min-size.
  • 2 --> No. Candidate because previous file was selected and 2 is less than the min-size, but the max-number of files to compact has been reached.
  • 1 --> No. Candidate because previous file was selected and 1 is less than the min-size, but max-number of files to compact has been reached.

9.7.5.5.5. Impact of Key Configuration Options

hbase.store.compaction.ratio. A large ratio (e.g., 10) will produce a single giant file. Conversely, a value of .25 will produce behavior similar to the BigTable compaction algorithm - resulting in 4 StoreFiles.

hbase.hstore.compaction.min.size. Because this limit represents the "automatic include" limit for all StoreFiles smaller than this value, this value may need to be adjusted downwards in write-heavy environments where many 1 or 2 mb StoreFiles are being flushed, because every file will be targeted for compaction and the resulting files may still be under the min-size and require further compaction, etc.



[25For description of the development process -- why static blooms rather than dynamic -- and for an overview of the unique properties that pertain to blooms in HBase, as well as possible future directions, see the Development Process section of the document BloomFilters in HBaseattached to HBase-1200.

[26The bloom filters described here are actually version two of blooms in HBase. In versions up to 0.19.x, HBase had a dynamic bloom option based on work done by the European Commission One-Lab Project 034819. The core of the HBase bloom work was later pulled up into Hadoop to implement org.apache.hadoop.io.BloomMapFile. Version 1 of HBase blooms never worked that well. Version 2 is a rewrite from scratch though again it starts with the one-lab work.

9.8. 批量裝載(Bulk Loading)

9.8.1. 概述

HBase 有好幾種方法將數據裝載到表。最直接的方式即可以通過MapReduce任務,也可以通過普通客戶端API。但是這都不是高效方法。

批量裝載特性採用 MapReduce 任務,將表數據輸出爲HBase的內部數據格式,然後可以將產生的存儲文件直接裝載到運行的集羣中。批量裝載比簡單使用 HBase API 消耗更少的CPU和網絡資源。

9.8.2. 批量裝載架構

HBase 批量裝載過程包含兩個主要步驟。

9.8.2.1. 通過MapReduce 任務準備數據

批量裝載第一步,從MapReduce任務通過HFileOutputFormat產生HBase數據文件(StoreFiles) 。輸出數據爲HBase的內部數據格式,以便隨後裝載到集羣更高效。

爲了處理高效, HFileOutputFormat 必須比配置爲每個HFile適合在一個分區內。爲了做到這一點,輸出將被批量裝載到HBase的任務,使用Hadoop 的TotalOrderPartitioner 類來分開map輸出爲分開的鍵空間區間。對應於表內每個分區(region)的鍵空間。

HFileOutputFormat 包含一個方便的函數, configureIncrementalLoad(), 可以基於表當前分區邊界自動設置TotalOrderPartitioner

9.8.2.2. 完成數據裝載

After the data has been prepared using HFileOutputFormat, it is loaded into the cluster using completebulkload. This command line tool iterates through the prepared data files, and for each one determines the region the file belongs to. It then contacts the appropriate Region Server which adopts the HFile, moving it into its storage directory and making the data available to clients.

If the region boundaries have changed during the course of bulk load preparation, or between the preparation and completion steps, thecompletebulkloads utility will automatically split the data files into pieces corresponding to the new boundaries. This process is not optimally efficient, so users should take care to minimize the delay between preparing a bulk load and importing it into the cluster, especially if other clients are simultaneously loading data through other means.

9.8.3. 採用completebulkload 工具導入準備的數據

After a data import has been prepared, either by using the importtsv tool with the "importtsv.bulk.output" option or by some other MapReduce job using the HFileOutputFormat, the completebulkload tool is used to import the data into the running cluster.

The completebulkload tool simply takes the output path where importtsv or your MapReduce job put its results, and the table name to import into. For example:

$ hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable

The -c config-file option can be used to specify a file containing the appropriate hbase parameters (e.g., hbase-site.xml) if not supplied already on the CLASSPATH (In addition, the CLASSPATH must contain the directory that has the zookeeper configuration file if zookeeper is NOT managed by HBase).

Note: If the target table does not already exist in HBase, this tool will create the table automatically.

This tool will run quickly, after which point the new data will be visible in the cluster.

9.8.4. 參考

For more information about the referenced utilities, see Section 14.1.9, “ImportTsv” and Section 14.1.10, “CompleteBulkLoad”.

9.8.5. 高級使用

Although the importtsv tool is useful in many cases, advanced users may want to generate data programatically, or import data from other formats. To get started doing so, dig into ImportTsv.java and check the JavaDoc for HFileOutputFormat.

The import step of the bulk load can also be done programatically. 參考 the LoadIncrementalHFiles class for more information.

9.9. HDFS

由於 HBase 在 HDFS 上運行(每個存儲文件也被寫爲HDFS的文件),必須理解 HDFS 結構,特別是它如何存儲文件,處理故障轉移,備份塊。

參考 Hadoop 文檔 HDFS Architecture 獲取更多信息。

9.9.1. NameNode

NameNode 負責維護文件系統元數據。參考上述HDFS結構鏈接獲取更多信息。

9.9.2. DataNode

DataNode 負責存儲HDFS 塊。 參考上述HDFS結構鏈接獲取更多信息。

 


[23] 參考 HBASE-2958 When hbase.hlog.split.skip.errors is set to false, we fail the split but thats it. We need to do more than just fail split if this flag is set.

[25] For description of the development process -- why static blooms rather than dynamic -- and for an overview of the unique properties that pertain to blooms in HBase, as well as possible future directions, see the Development Process section of the documentBloomFilters in HBase attached to HBase-1200.

[26] The bloom filters described here are actually version two of blooms in HBase. In versions up to 0.19.x, HBase had a dynamic bloom option based on work done by the European Commission One-Lab Project 034819. The core of the HBase bloom work was later pulled up into Hadoop to implement org.apache.hadoop.io.BloomMapFile. Version 1 of HBase blooms never worked that well. Version 2 is a rewrite from scratch though again it starts with the one-lab work.

Chapter 10. 外部 API

This chapter will cover access to HBase either through non-Java languages, or through custom protocols.

10.1. 非Java 語言和 JVM 通話

當前本話題大部分文檔在 HBase Wiki. 參考 Thrift API Javadoc.

10.2. REST

當前 REST大部分文檔在 HBase Wiki on REST.

10.3. Thrift

當前 Thrift大部分文檔在 HBase Wiki on Thrift.

10.3.1. 過濾器語言

10.3.1.1. 用例

注意: 本特性在 HBase 0.92 中加入。

This allows the user to perform server-side filtering when accessing HBase over Thrift. The user specifies a filter via a string. The string is parsed on the server to construct the filter

10.3.1.2. 通用過濾字符串語法

A simple filter expression is expressed as: “FilterName (argument, argument, ... , argument)”

You must specify the name of the filter followed by the argument list in parenthesis. Commas separate the individual arguments

If the argument represents a string, it should be enclosed in single quotes.

If it represents a boolean, an integer or a comparison operator like <, >, != etc. it should not be enclosed in quotes

The filter name must be one word. All ASCII characters are allowed except for whitespace, single quotes and parenthesis.

The filter’s arguments can contain any ASCII character. If single quotes are present in the argument, they must be escaped by a preceding single quote

10.3.1.3. Compound Filters and Operators

Currently, two binary operators – AND/OR and two unary operators – WHILE/SKIP are supported.

Note: the operators are all in uppercase

AND – as the name suggests, if this operator is used, the key-value must pass both the filters

OR – as the name suggests, if this operator is used, the key-value must pass at least one of the filters

SKIP – For a particular row, if any of the key-values don’t pass the filter condition, the entire row is skipped

WHILE - For a particular row, it continues to emit key-values until a key-value is reached that fails the filter condition

Compound Filters: Using these operators, a hierarchy of filters can be created. For example: “(Filter1 AND Filter2) OR (Filter3 AND Filter4)”

10.3.1.4. Order of Evaluation

Parenthesis have the highest precedence. The SKIP and WHILE operators are next and have the same precedence.The AND operator has the next highest precedence followed by the OR operator.

For example:

A filter string of the form:“Filter1 AND Filter2 OR Filter3” will be evaluated as:“(Filter1 AND Filter2) OR Filter3”

A filter string of the form:“Filter1 AND SKIP Filter2 OR Filter3” will be evaluated as:“(Filter1 AND (SKIP Filter2)) OR Filter3”

10.3.1.5. 比較運算符

比較運算符可以是下面之一:

  1. LESS (<)

  2. LESS_OR_EQUAL (<=)

  3. EQUAL (=)

  4. NOT_EQUAL (!=)

  5. GREATER_OR_EQUAL (>=)

  6. GREATER (>)

  7. NO_OP (no operation)

客戶端應該使用 (<, <=, =, !=, >, >=) 來表達比較操作.

10.3.1.6. 比較器(Comparator)

比較器可以是下面之一:

  1. BinaryComparator - This lexicographically compares against the specified byte array using Bytes.compareTo(byte[], byte[])

  2. BinaryPrefixComparator - This lexicographically compares against a specified byte array. It only compares up to the length of this byte array.

  3. RegexStringComparator - This compares against the specified byte array using the given regular expression. Only EQUAL and NOT_EQUAL comparisons are valid with this comparator

  4. SubStringComparator - This tests if the given substring appears in a specified byte array. The comparison is case insensitive. Only EQUAL and NOT_EQUAL comparisons are valid with this comparator

The general syntax of a comparator is: ComparatorType:ComparatorValue

The ComparatorType for the various comparators is as follows:

  1. BinaryComparator - binary

  2. BinaryPrefixComparator - binaryprefix

  3. RegexStringComparator - regexstring

  4. SubStringComparator - substring

The ComparatorValue can be any value.

Example1: >, 'binary:abc' will match everything that is lexicographically greater than "abc"

Example2: =, 'binaryprefix:abc' will match everything whose first 3 characters are lexicographically equal to "abc"

Example3: !=, 'regexstring:ab*yz' will match everything that doesn't begin with "ab" and ends with "yz"

Example4: =, 'substring:abc123' will match everything that begins with the substring "abc123"

10.3.1.7.  PHP 客戶端編程使用過濾器示例

<? $_SERVER['PHP_ROOT'] = realpath(dirname(__FILE__).'/..'); require_once $_SERVER['PHP_ROOT'].'/flib/__flib.php'; flib_init(FLIB_CONTEXT_SCRIPT); require_module('storage/hbase'); $hbase = new HBase('<server_name_running_thrift_server>', <port on which thrift server is running>); $hbase->open(); $client = $hbase->getClient(); $result = $client->scannerOpenWithFilterString('table_name', "(PrefixFilter ('row2') AND (QualifierFilter (>=, 'binary:xyz'))) AND (TimestampsFilter ( 123, 456))"); $to_print = $client->scannerGetList($result,1); while ($to_print) { print_r($to_print); $to_print = $client->scannerGetList($result,1); } $client->scannerClose($result); ?>

10.3.1.8. 過濾字符串示例

  • “PrefixFilter (‘Row’) AND PageFilter (1) AND FirstKeyOnlyFilter ()” will return all key-value pairs that match the following conditions:

    1) The row containing the key-value should have prefix “Row”

    2) The key-value must be located in the first row of the table

    3) The key-value pair must be the first key-value in the row

  • “(RowFilter (=, ‘binary:Row 1’) AND TimeStampsFilter (74689, 89734)) OR ColumnRangeFilter (‘abc’, true, ‘xyz’, false))” will return all key-value pairs that match both the following conditions:

    1) The key-value is in a row having row key “Row 1”

    2) The key-value must have a timestamp of either 74689 or 89734.

    Or it must match the following condition:

    1) The key-value pair must be in a column that is lexicographically >= abc and < xyz 

  • “SKIP ValueFilter (0)” will skip the entire row if any of the values in the row is not 0

10.3.1.9. 獨有過濾器語法

  1. KeyOnlyFilter

    Description: This filter doesn’t take any arguments. It returns only the key component of each key-value.

    Syntax: KeyOnlyFilter ()

    Example: "KeyOnlyFilter ()"

  2. FirstKeyOnlyFilter

    Description: This filter doesn’t take any arguments. It returns only the first key-value from each row.

    Syntax: FirstKeyOnlyFilter ()

    Example: "FirstKeyOnlyFilter ()"

  3. PrefixFilter

    Description: This filter takes one argument – a prefix of a row key. It returns only those key-values present in a row that starts with the specified row prefix

    Syntax: PrefixFilter (‘<row_prefix>’)

    Example: "PrefixFilter (‘Row’)"

  4. ColumnPrefixFilter

    Description: This filter takes one argument – a column prefix. It returns only those key-values present in a column that starts with the specified column prefix. The column prefix must be of the form: “qualifier”

    Syntax:ColumnPrefixFilter(‘<column_prefix>’)

    Example: "ColumnPrefixFilter(‘Col’)"

  5. MultipleColumnPrefixFilter

    Description: This filter takes a list of column prefixes. It returns key-values that are present in a column that starts with any of the specified column prefixes. Each of the column prefixes must be of the form: “qualifier”

    Syntax:MultipleColumnPrefixFilter(‘<column_prefix>’, ‘<column_prefix>’, …, ‘<column_prefix>’)

    Example: "MultipleColumnPrefixFilter(‘Col1’, ‘Col2’)"

  6. ColumnCountGetFilter

    Description: This filter takes one argument – a limit. It returns the first limit number of columns in the table

    Syntax: ColumnCountGetFilter (‘<limit>’)

    Example: "ColumnCountGetFilter (4)"

  7. PageFilter

    Description: This filter takes one argument – a page size. It returns page size number of rows from the table.

    Syntax: PageFilter (‘<page_size>’)

    Example: "PageFilter (2)"

  8. ColumnPaginationFilter

    Description: This filter takes two arguments – a limit and offset. It returns limit number of columns after offset number of columns. It does this for all the rows

    Syntax: ColumnPaginationFilter(‘<limit>’, ‘<offest>’)

    Example: "ColumnPaginationFilter (3, 5)"

  9. InclusiveStopFilter

    Description: This filter takes one argument – a row key on which to stop scanning. It returns all key-values present in rows up to and including the specified row

    Syntax: InclusiveStopFilter(‘<stop_row_key>’)

    Example: "InclusiveStopFilter ('Row2')"

  10. TimeStampsFilter

    Description: This filter takes a list of timestamps. It returns those key-values whose timestamps matches any of the specified timestamps

    Syntax: TimeStampsFilter (<timestamp>, <timestamp>, ... ,<timestamp>)

    Example: "TimeStampsFilter (5985489, 48895495, 58489845945)"

  11. RowFilter

    Description: This filter takes a compare operator and a comparator. It compares each row key with the comparator using the compare operator and if the comparison returns true, it returns all the key-values in that row

    Syntax: RowFilter (<compareOp>, ‘<row_comparator>’)

    Example: "RowFilter (<=, ‘xyz)"

  12. Family Filter

    Description: This filter takes a compare operator and a comparator. It compares each qualifier name with the comparator using the compare operator and if the comparison returns true, it returns all the key-values in that column

    Syntax: QualifierFilter (<compareOp>, ‘<qualifier_comparator>’)

    Example: "QualifierFilter (=, ‘Column1’)"

  13. QualifierFilter

    Description: This filter takes a compare operator and a comparator. It compares each qualifier name with the comparator using the compare operator and if the comparison returns true, it returns all the key-values in that column

    Syntax: QualifierFilter (<compareOp>,‘<qualifier_comparator>’)

    Example: "QualifierFilter (=,‘Column1’)"

  14. ValueFilter

    Description: This filter takes a compare operator and a comparator. It compares each value with the comparator using the compare operator and if the comparison returns true, it returns that key-value

    Syntax: ValueFilter (<compareOp>,‘<value_comparator>’)

    Example: "ValueFilter (!=, ‘Value’)"

  15. DependentColumnFilter

    Description: This filter takes two arguments – a family and a qualifier. It tries to locate this column in each row and returns all key-values in that row that have the same timestamp. If the row doesn’t contain the specified column – none of the key-values in that row will be returned.

    The filter can also take an optional boolean argument – dropDependentColumn. If set to true, the column we were depending on doesn’t get returned.

    The filter can also take two more additional optional arguments – a compare operator and a value comparator, which are further checks in addition to the family and qualifier. If the dependent column is found, its value should also pass the value check and then only is its timestamp taken into consideration

    Syntax: DependentColumnFilter (‘<family>’, ‘<qualifier>’, <boolean>, <compare operator>, ‘<value comparator’)

    Syntax: DependentColumnFilter (‘<family>’, ‘<qualifier>’, <boolean>)

    Syntax: DependentColumnFilter (‘<family>’, ‘<qualifier>’)

    Example: "DependentColumnFilter (‘conf’, ‘blacklist’, false, >=, ‘zebra’)"

    Example: "DependentColumnFilter (‘conf’, 'blacklist', true)"

    Example: "DependentColumnFilter (‘conf’, 'blacklist')"

  16. SingleColumnValueFilter

    Description: This filter takes a column family, a qualifier, a compare operator and a comparator. If the specified column is not found – all the columns of that row will be emitted. If the column is found and the comparison with the comparator returns true, all the columns of the row will be emitted. If the condition fails, the row will not be emitted.

    This filter also takes two additional optional boolean arguments – filterIfColumnMissing and setLatestVersionOnly

    If the filterIfColumnMissing flag is set to true the columns of the row will not be emitted if the specified column to check is not found in the row. The default value is false.

    If the setLatestVersionOnly flag is set to false, it will test previous versions (timestamps) too. The default value is true.

    These flags are optional and if you must set neither or both

    Syntax: SingleColumnValueFilter(<compare operator>, ‘<comparator>’, ‘<family>’, ‘<qualifier>’,<filterIfColumnMissing_boolean>, <latest_version_boolean>)

    Syntax: SingleColumnValueFilter(<compare operator>, ‘<comparator>’, ‘<family>’, ‘<qualifier>)

    Example: "SingleColumnValueFilter (<=, ‘abc’,‘FamilyA’, ‘Column1’, true, false)"

    Example: "SingleColumnValueFilter (<=, ‘abc’,‘FamilyA’, ‘Column1’)"

  17. SingleColumnValueExcludeFilter

    Description: This filter takes the same arguments and behaves same as SingleColumnValueFilter – however, if the column is found and the condition passes, all the columns of the row will be emitted except for the tested column value.

    Syntax: SingleColumnValueExcludeFilter(<compare operator>, '<comparator>', '<family>', '<qualifier>',<latest_version_boolean>, <filterIfColumnMissing_boolean>)

    Syntax: SingleColumnValueExcludeFilter(<compare operator>, '<comparator>', '<family>', '<qualifier>')

    Example: "SingleColumnValueExcludeFilter (‘<=’, ‘abc’,‘FamilyA’, ‘Column1’, ‘false’, ‘true’)"

    Example: "SingleColumnValueExcludeFilter (‘<=’, ‘abc’, ‘FamilyA’, ‘Column1’)"

  18. ColumnRangeFilter

    Description: This filter is used for selecting only those keys with columns that are between minColumn and maxColumn. It also takes two boolean variables to indicate whether to include the minColumn and maxColumn or not.

    If you don’t want to set the minColumn or the maxColumn – you can pass in an empty argument.

    Syntax: ColumnRangeFilter (‘<minColumn>’, <minColumnInclusive_bool>, ‘<maxColumn>’, <maxColumnInclusive_bool>)

    Example: "ColumnRangeFilter (‘abc’, true, ‘xyz’, false)"

10.4. C/C++ Apache HBase Client

Facebook的 Chip Turner 寫了個純 C/C++ 客戶端。 Check it out.

Chapter 11. 性能調優

11.1. 操作系統

11.1.1. 內存

RAM, RAM, RAM. 不要餓着 HBase.

11.1.2. 64-bit

使用 64-bit 平臺(和64-bit JVM).

11.1.3. 交換區

小心交換,將交換區設爲0。

11.2. 網絡

也許,避免網絡問題降低Hadoop和HBase性能的最重要因素就是所使用的交換機硬件。在項目範圍內,集羣大小翻倍或三倍甚至更多時,早期的決定可能導致主要的問題。

要考慮的重要事項:

  • 設備交換機容量
  • 系統連接數量
  • 上行容量

11.2.1. 單交換機

單交換機配置最重要的因素,是硬件交換容量,所有系統連接到交換機產生的流量的處理能力。一些低價硬件商品,相對全交換機,具有較低交換能力。

11.2.2. 多交換機

多交換機在系統結構中是潛在陷阱。低價硬件的最常用配置是1Gbps上行連接到另一個交換機。 該常被忽略的窄點很容易成爲集羣通訊的瓶頸。特別是MapReduce任務通過該上行連接同時讀寫大量數據時,會導致飽和。

緩解該問題很簡單,可以通過多種途徑完成:

  • 針對要創建的集羣容量,採用合適硬件。
  • 採用更大單交換機配置,如單48 端口相較2x 24 端口爲優。
  • 配置上行端口聚合(port trunking)來利用多網絡接口增加交換機帶寬。(譯者注:port trunk:
    將交換機上的多個端口在物理上連接起來,在邏輯上捆綁在一起,形成一個擁有較大帶寬的端口,組成一個幹路,以達到平衡負載和提供備份線路,擴充帶寬的目的。 )

11.2.3. 多機架

多機架配置帶來多交換機同樣的潛在問題。導致性能降低的原因主要來自兩個方面:

  • 較低的交換機容量性能
  • 到其他機架的上行鏈路不足

如果機架上的交換機有合適交換容量,可以處理所有主機全速通信,那麼下一個問題就是如何自動導航更多的交錯在機架中的集羣。最簡單的避免橫跨多機架問題的辦法,是採用端口聚合來創建到其他機架的捆綁的上行的連接。然而該方法下行側,是潛在被使用的端口開銷。舉例:從機架A到機架B創建 8Gbps 端口通道,採用24端口中的8個來和其他機架互通,ROI(投資回報率)很低。採用太少端口意味着不能從集羣中傳出最多的東西。

機架間採用10Gbe 鏈接將極大增加性能,確保交換機都支持10Gbe 上行連接或支持擴展卡,後者相對上行連接,允許你節省機器端口。

11.2.4. 網絡接口

所有網絡接口功能正常嗎?你確定?參考故障診斷用例:Section 13.3.1, “Case Study #1 (Performance Issue On A Single Node)”.

 

可以從 wiki Performance Tuning看起。這個文檔講了一些主要的影響性能的方面:RAM, 壓縮, JVM 設置, 等等。然後,可以看看下面的補充內容。

打開RPC-level日誌

在區域服務器打開RPC-level的日誌對於深度的優化是有好處的。一旦打開,日誌將噴涌而出。所以不建議長時間打開,只能看一小段時間。要想啓用RPC-level的職責,可以使用區域服務器 UI點擊Log Level。將 org.apache.hadoop.ipc 的日誌級別設爲DEBUG。然後tail 區域服務器的日誌,進行分析。

要想關閉,只要把日誌級別設爲INFO就可以了.

11.3. Java

11.3.1. 垃圾收集和HBase

11.3.1.1. 長時間GC停頓

在這個PPT Avoiding Full GCs with MemStore-Local Allocation Buffers, Todd Lipcon描述列在HBase中常見的兩種stop-the-world的GC操作,尤其是在loading的時候。一種是CMS失敗的模式(譯者注:CMS是一種GC的算法),另一種是老一代的堆碎片導致的。要想定位第一種,只要將CMS執行的時間提前就可以了,加入-XX:CMSInitiatingOccupancyFraction參數,把值調低。可以先從60%和70%開始(這個值調的越低,觸發的GC次數就越多,消耗的CPU時間就越長)。要想定位第二種錯誤,Todd加入了一個實驗性的功能,在HBase 0.90.x中這個是要明確指定的(在0.92.x中,這個是默認項),將你的Configuration中的hbase.hregion.memstore.mslab.enabled設置爲true。詳細信息,可以看這個PPT.

11.4. 配置

參見Section 2.8.2, “推薦的配置”.

11.4.1. Regions的數目

HBase中region的數目可以根據Section 3.6.5, “更大的 Regions”調整.也可以參見 Section 12.3.1, “Region大小”

11.4.2. 管理壓縮

對於大型的系統,你需要考慮管理壓縮和分割

11.4.3. hbase.regionserver.handler.count

參見hbase.regionserver.handler.count.這個參數的本質是設置一個RegsionServer可以同時處理多少請求。 如果定的太高,吞吐量反而會降低;如果定的太低,請求會被阻塞,得不到響應。你可以打開RPC-level日誌讀Log,來決定對於你的集羣什麼值是合適的。(請求隊列也是會消耗內存的)

11.4.4. hfile.block.cache.size

參見 hfile.block.cache.size. 對於區域服務器進程的內存設置。

11.4.5. hbase.regionserver.global.memstore.upperLimit

參見 hbase.regionserver.global.memstore.upperLimit. 這個內存設置是根據區域服務器的需要來設定。

11.4.6. hbase.regionserver.global.memstore.lowerLimit

參見 hbase.regionserver.global.memstore.lowerLimit. 這個內存設置是根據區域服務器的需要來設定。

11.4.7. hbase.hstore.blockingStoreFiles

參見hbase.hstore.blockingStoreFiles. 如果在區域服務器的Log中block,提高這個值是有幫助的。

11.4.8. hbase.hregion.memstore.block.multiplier

參見 hbase.hregion.memstore.block.multiplier. 如果有足夠的RAM,提高這個值。

11.5. ZooKeeper

配置ZooKeeper信息,請參考 Section 2.5, “ZooKeeper”  , 參看關於使用專用磁盤部分。

11.6. 模式設計

11.6.1.  列族的數目

參見 Section 6.2, “ On the number of column families ”.

11.6.2. 鍵和屬性長度

參考 Section 6.3.2, “Try to minimize row and column sizes”. 參考 also Section 11.6.7.1, “However...” for compression caveats.

11.6.3. 表的區域大小

The regionsize can be set on a per-table basis via setFileSize on HTableDescriptor in the event where certain tables require different regionsizes than the configured default regionsize.

參考 Section 11.4.1, “Number of Regions” for more information.

11.6.4. 布隆過濾

Bloom Filters can be enabled per-ColumnFamily. Use HColumnDescriptor.setBloomFilterType(NONE | ROW | ROWCOL) to enable blooms per Column Family. Default = NONE for no bloom filters. If ROW, the hash of the row will be added to the bloom on each insert. If ROWCOL, the hash of the row + column family + column family qualifier will be added to the bloom on each key insert.

參考 HColumnDescriptor and Section 9.7.6, “Bloom Filters” for more information.

11.6.5. 列族塊大小

The blocksize can be configured for each ColumnFamily in a table, and this defaults to 64k. Larger cell values require larger blocksizes. There is an inverse relationship between blocksize and the resulting StoreFile indexes (i.e., if the blocksize is doubled then the resulting indexes should be roughly halved).

參考 HColumnDescriptor and Section 9.7.5, “Store”for more information.

11.6.6. 內存中的列族

ColumnFamilies can optionally be defined as in-memory. Data is still persisted to disk, just like any other ColumnFamily. In-memory blocks have the highest priority in the Section 9.6.4, “Block Cache”, but it is not a guarantee that the entire table will be in memory.

參考 HColumnDescriptor for more information.

11.6.7. 壓縮

Production systems should use compression with their ColumnFamily definitions. 參考 Appendix C, Compression In HBase for more information.

11.6.7.1. 然而...

Compression deflates data on disk. When it's in-memory (e.g., in the MemStore) or on the wire (e.g., transferring between RegionServer and Client) it's inflated. So while using ColumnFamily compression is a best practice, but it's not going to completely eliminate the impact of over-sized Keys, over-sized ColumnFamily names, or over-sized Column names.

參考 Section 6.3.2, “Try to minimize row and column sizes” on for schema design tips, and Section 9.7.5.4, “KeyValue” for more information on HBase stores data internally.

 

11.7. 寫到 HBase

11.7.1. 批量裝載

如果可以的話,儘量使用批量導入工具,參見 Section 9.8, “批量裝載”.否則就要詳細看看下面的內容。

11.7.2. 表創建: 預創建區域(Region)

默認情況下HBase創建表會新建一個區域。執行批量導入,意味着所有的client會寫入這個區域,直到這個區域足夠大,以至於分裂。一個有效的提高批量導入的性能的方式,是預創建空的區域。最好稍保守一點,因爲過多的區域會實實在在的降低性能。下面是一個預創建區域的例子。 (注意:這個例子裏需要根據應用的key進行調整。):

public static boolean createTable(HBaseAdmin admin, HTableDescriptor table, byte[][] splits) throws IOException { try { admin.createTable( table, splits ); return true; } catch (TableExistsException e) { logger.info("table " + table.getNameAsString() + " already exists"); // the table already exists... return false; } } public static byte[][] getHexSplits(String startKey, String endKey, int numRegions) { byte[][] splits = new byte[numRegions-1][]; BigInteger lowestKey = new BigInteger(startKey, 16); BigInteger highestKey = new BigInteger(endKey, 16); BigInteger range = highestKey.subtract(lowestKey); BigInteger regionIncrement = range.divide(BigInteger.valueOf(numRegions)); lowestKey = lowestKey.add(regionIncrement); for(int i=0; i < numRegions-1;i++) { BigInteger key = lowestKey.add(regionIncrement.multiply(BigInteger.valueOf(i))); byte[] b = String.format("%016x", key).getBytes(); splits[i] = b; } return splits; }

11.7.3.  表創建: 延遲log刷寫

Puts的缺省行爲使用 Write Ahead Log (WAL),會導致 HLog 編輯立即寫盤。如果採用延遲刷寫,WAL編輯會保留在內存中,直到刷寫週期來臨。好處是集中和異步寫HLog,潛在問題是如果RegionServer退出,沒有刷寫的日誌將丟失。但這也比Puts時不使用WAL安全多了。

延遲log刷寫可以通過 HTableDescriptor 在表上設置,hbase.regionserver.optionallogflushinterval缺省值是1000ms.

11.7.4. HBase 客戶端: 自動刷寫

 

當你進行大量的Put的時候,要確認你的HTable的setAutoFlush是關閉着的。否則的話,每執行一個Put就要想區域服務器發一個請求。通過 htable.add(Put) 和htable.add( <List> Put)來將Put添加到寫緩衝中。如果 autoFlush = false,要等到寫緩衝都填滿的時候纔會發起請求。要想顯式的發起請求,可以調用flushCommits。在HTable實例上進行的close操作也會發起flushCommits

11.7.5. HBase 客戶端: 在Puts上關閉WAL

一個經常討論的在Puts上增加吞吐量的選項是調用 writeToWAL(false)。關閉它意味着 RegionServer 不再將 Put 寫到 Write Ahead Log, 僅寫到內存。然而後果是如果出現 RegionServer 失敗,將導致數據丟失。如果調用 writeToWAL(false) ,需保持高度警惕。你會發現實際上基本沒有不同,如果你的負載很好的分佈到集羣中。

通常而言,最好對Puts使用WAL, 而增加負載吞吐量與使用 bulk loading 替代技術有關。

11.7.6. HBase 客戶端: Group Puts by RegionServer

In addition to using the writeBuffer, grouping Puts by RegionServer can reduce the number of client RPC calls per writeBuffer flush. There is a utility HTableUtil currently on TRUNK that does this, but you can either copy that or implement your own verison for those still on 0.90.x or earlier.

11.7.7. MapReduce: Skip The Reducer

When writing a lot of data to an HBase table from a MR job (e.g., with TableOutputFormat), and specifically where Puts are being emitted from the Mapper, skip the Reducer step. When a Reducer step is used, all of the output (Puts) from the Mapper will get spooled to disk, then sorted/shuffled to other Reducers that will most likely be off-node. It's far more efficient to just write directly to HBase.

For summary jobs where HBase is used as a source and a sink, then writes will be coming from the Reducer step (e.g., summarize values then write out result). This is a different processing problem than from the the above case.

11.7.8. Anti-Pattern: One Hot Region

If all your data is being written to one region at a time, then re-read the section on processing timeseries data.

Also, if you are pre-splitting regions and all your data is still winding up in a single region even though your keys aren't monotonically increasing, confirm that your keyspace actually works with the split strategy. There are a variety of reasons that regions may appear "well split" but won't work with your data. As the HBase client communicates directly with the RegionServers, this can be obtained viaHTable.getRegionLocation.

參考 Section 11.7.2, “ Table Creation: Pre-Creating Regions ”, as well as Section 11.4, “HBase Configurations”

11.8. 從HBase讀

11.8.1. Scan 緩存

如果HBase的輸入源是一個MapReduce Job,要確保輸入的ScansetCaching值要比默認值0要大。使用默認值就意味着map-task每一行都會去請求一下region-server。可以把這個值設爲500,這樣就可以一次傳輸500行。當然這也是需要權衡的,過大的值會同時消耗客戶端和服務端很大的內存,不是越大越好。

11.8.1.1. Scan Caching in MapReduce Jobs

Scan settings in MapReduce jobs deserve special attention. Timeouts can result (e.g., UnknownScannerException) in Map tasks if it takes longer to process a batch of records before the client goes back to the RegionServer for the next set of data. This problem can occur because there is non-trivial processing occuring per row. If you process rows quickly, set caching higher. If you process rows more slowly (e.g., lots of transformations per row, writes), then set caching lower.

Timeouts can also happen in a non-MapReduce use case (i.e., single threaded HBase client doing a Scan), but the processing that is often performed in MapReduce jobs tends to exacerbate this issue.

11.8.2. Scan 屬性選擇

當Scan用來處理大量的行的時候(尤其是作爲MapReduce的輸入),要注意的是選擇了什麼字段。如果調用了 scan.addFamily,這個列族的所有屬性都會返回。如果只是想過濾其中的一小部分,就指定那幾個column,否則就會造成很大浪費,影響性能。

11.8.3. MapReduce - Input Splits

For MapReduce jobs that use HBase tables as a source, if there a pattern where the "slow" map tasks seem to have the same Input Split (i.e., the RegionServer serving the data), see the Troubleshooting Case Study in Section 13.3.1, “Case Study #1 (Performance Issue On A Single Node)”.

11.8.4. 關閉 ResultScanners

這與其說是提高性能,倒不如說是避免發生性能問題。如果你忘記了關閉ResultScanners,會導致RegionServer出現問題。所以一定要把ResultScanner包含在try/catch 塊中...

Scan scan = new Scan(); // set attrs... ResultScanner rs = htable.getScanner(scan); try { for (Result r = rs.next(); r != null; r = rs.next()) { // process result... } finally { rs.close(); // always close the ResultScanner! } htable.close();

11.8.5. 塊緩存

Scan實例可以在RegionServer中使用塊緩存,可以由setCacheBlocks方法控制。如果Scan是MapReduce的輸入源,要將這個值設置爲 false。對於經常讀到的行,就建議使用塊緩衝。

11.8.6.  行鍵的負載優化

scan一個表的時候, 如果僅僅需要行鍵(不需要no families, qualifiers, values 和 timestamps),在加入FilterList的時候,要使用Scanner的setFilter方法的時候,要填上MUST_PASS_ALL操作參數(譯者注:相當於And操作符)。一個FilterList要包含一個 FirstKeyOnlyFilter 和一個 KeyOnlyFilter.通過這樣的filter組合,就算在最壞的情況下,RegionServer只會從磁盤讀一個值,同時最小化客戶端的網絡帶寬佔用。

11.8.7. Concurrency: Monitor Data Spread

When performing a high number of concurrent reads, monitor the data spread of the target tables. If the target table(s) have too few regions then the reads could likely be served from too few nodes.

參考 Section 11.7.2, “ Table Creation: Pre-Creating Regions ”, as well as Section 11.4, “HBase Configurations”

11.8.8. Bloom Filters

Enabling Bloom Filters can save your having to go to disk and can help improve read latencys.

Bloom filters were developed over in HBase-1200 Add bloomfilters.[28][29]

See also Section 11.6.4, “Bloom Filters”.

11.8.8.1. Bloom StoreFile footprint

Bloom filters add an entry to the StoreFile general FileInfo data structure and then two extra entries to the StoreFilemetadata section.

11.8.8.1.1. BloomFilter in the StoreFile FileInfo data structure

FileInfo has a BLOOM_FILTER_TYPE entry which is set to NONE, ROW or ROWCOL.

11.8.8.1.2. BloomFilter entries in StoreFile metadata

BLOOM_FILTER_META holds Bloom Size, Hash Function used, etc. Its small in size and is cached on StoreFile.Reader load

BLOOM_FILTER_DATA is the actual bloomfilter data. Obtained on-demand. Stored in the LRU cache, if it is enabled (Its enabled by default).

 

11.8.8.2. 布隆過濾(Bloom Filter) 配置

11.8.8.2.1. io.hfile.bloom.enabled 全局殺死開關

配置文件的io.hfile.bloom.enabled 是一個當出錯時的殺死開關。Default = true.

11.8.8.2.2. io.hfile.bloom.error.rate

io.hfile.bloom.error.rate = 平均假陽性率( average false positive rate ). 缺省 = 1%. 降低率爲 ½ (如 .5%) == +1 位每布隆入口。

11.8.8.2.3. io.hfile.bloom.max.fold

io.hfile.bloom.max.fold = 保證最小摺疊速率(guaranteed minimum fold rate). 大多時候不要管. Default = 7, 或壓縮到原來大小的至少 1/128. 想獲取更多本選項的意義,參看本文檔 開發進程 節 BloomFilters in HBase 

 

 

11.9. 從HBase刪除

11.9.1. 將 HBase 表當 Queues

HBase tables are sometimes used as queues. In this case, special care must be taken to regularly perform major compactions on tables used in this manner. As is documented in Chapter 5, Data Model, marking rows as deleted creates additional StoreFiles which then need to be processed on reads. Tombstones only get cleaned up with major compactions.

參考 also Section 9.7.5.5, “Compaction” and HBaseAdmin.majorCompact.

11.9.2. 刪除的 RPC 行爲

Be aware that htable.delete(Delete) doesn't use the writeBuffer. It will execute an RegionServer RPC with each invocation. For a large number of deletes, consider htable.delete(List).

參考 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#delete%28org.apache.hadoop.hbase.client.Delete%29

11.10. HDFS

由於 HBase 在 Section 9.9, “HDFS” 上運行,it is important to understand how it works and how it affects HBase.

11.10.1. Current Issues With Low-Latency Reads

The original use-case for HDFS was batch processing. As such, there low-latency reads were historically not a priority. With the increased adoption of HBase this is changing, and several improvements are already in development. 參考 the Umbrella Jira Ticket for HDFS Improvements for HBase.

11.10.2. Performance Comparisons of HBase vs. HDFS

A fairly common question on the dist-list is why HBase isn't as performant as HDFS files in a batch context (e.g., as a MapReduce source or sink). The short answer is that HBase is doing a lot more than HDFS (e.g., reading the KeyValues, returning the most current row or specified timestamps, etc.), and as such HBase is 4-5 times slower than HDFS in this processing context. Not that there isn't room for improvement (and this gap will, over time, be reduced), but HDFS will always be faster in this use-case.

11.11. Amazon EC2

Performance questions are common on Amazon EC2 environments because it is a shared environment. You will not see the same throughput as a dedicated server. In terms of running tests on EC2, run them several times for the same reason (i.e., it's a shared environment and you don't know what else is happening on the server).

If you are running on EC2 and post performance questions on the dist-list, please state this fact up-front that because EC2 issues are practically a separate class of performance issues.

11.12. Case Studies

For Performance and Troubleshooting Case Studies, see Chapter 13, Case Studies.

Chapter 12. HBase的故障排除和Debug

12.1. 一般準則

首先可以看看master的log。通常情況下,他總是一行一行的重複信息。如果不是這樣,說明有問題,可以Google或是用search-hadoop.com來搜索遇到的exception。

一個錯誤通常不是單獨出現在HBase中的,通常是某一個地方發生了異常,然後對其他的地方發生影響。到處都是exception和stack straces。遇到這樣的錯誤,最好的辦法是查日誌,找到最初的異常。例如Region會在abort的時候打印一下信息。Grep這個Dump就有可能找到最初的異常信息。

RegionServer的自殺是很“正常”的。當一些事情發生錯誤的,他們就會自殺。如果ulimit和xcievers(最重要的兩個設定,詳見Section 2.2.5, “ ulimit 和 nproc)沒有修改,HDFS將無法運轉正常,在HBase看來,HDFS死掉了。假想一下,你的MySQL突然無法訪問它的文件系統,他會怎麼做。同樣的事情會發生在HBase和HDFS上。還有一個造成RegionServer切腹(譯者注:竟然用日文詞)自殺的常見的原因是,他們執行了一個長時間的GC操作,這個時間超過了ZooKeeper的session timeout。關於GC停頓的詳細信息,參見Todd Lipcon的 3 part blog post by Todd Lipcon 和上面的 Section 11.3.1.1, “長時間GC停頓”.

12.2. Logs

重要日誌的位置( <user>是啓動服務的用戶,<hostname> 是機器的名字)

NameNode: $HADOOP_HOME/logs/hadoop-<user>-namenode-<hostname>.log

DataNode: $HADOOP_HOME/logs/hadoop-<user>-datanode-<hostname>.log

JobTracker: $HADOOP_HOME/logs/hadoop-<user>-jobtracker-<hostname>.log

TaskTracker: $HADOOP_HOME/logs/hadoop-<user>-jobtracker-<hostname>.log

HMaster: $HBASE_HOME/logs/hbase-<user>-master-<hostname>.log

RegionServer: $HBASE_HOME/logs/hbase-<user>-regionserver-<hostname>.log

ZooKeeper: TODO

12.2.1. Log 位置

對於單節點模式,Log都會在一臺機器上,但是對於生產環境,都會運行在一個集羣上。

12.2.1.1. NameNode

NameNode的日誌在NameNode server上。HBase Master 通常也運行在NameNode server上,ZooKeeper通常也是這樣。

對於小一點的機器,JobTracker也通常運行在NameNode server上面。

12.2.1.2. DataNode

每一臺DataNode server有一個HDFS的日誌,Region有一個HBase日誌。

每個DataNode server還有一份TaskTracker的日誌,來記錄MapReduce的Task信息。

12.2.2. Log Levels

12.2.2.1. Enabling RPC-level logging

Enabling the RPC-level logging on a RegionServer can often given insight on timings at the server. Once enabled, the amount of log spewed is voluminous. It is not recommended that you leave this logging on for more than short bursts of time. To enable RPC-level logging, browse to the RegionServer UI and click on Log Level. Set the log level to DEBUG for the package org.apache.hadoop.ipc (Thats right, for hadoop.ipc, NOT, hbase.ipc). Then tail the RegionServers log. Analyze.

To disable, set the logging level back to INFO level.

12.2.3. JVM Garbage Collection Logs

HBase is memory intensive, and using the default GC you can see long pauses in all threads including the Juliet Pause aka "GC of Death". To help debug this or confirm this is happening GC logging can be turned on in the Java virtual machine.

To enable, in hbase-env.sh add:

export HBASE_OPTS="-XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/home/hadoop/hbase/logs/gc-hbase.log"

Adjust the log directory to wherever you log. Note: The GC log does NOT roll automatically, so you'll have to keep an eye on it so it doesn't fill up the disk.

At this point you should see logs like so:

64898.952: [GC [1 CMS-initial-mark: 2811538K(3055704K)] 2812179K(3061272K), 0.0007360 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 64898.953: [CMS-concurrent-mark-start] 64898.971: [GC 64898.971: [ParNew: 5567K->576K(5568K), 0.0101110 secs] 2817105K->2812715K(3061272K), 0.0102200 secs] [Times: user=0.07 sys=0.00, real=0.01 secs]

In this section, the first line indicates a 0.0007360 second pause for the CMS to initially mark. This pauses the entire VM, all threads for that period of time.

The third line indicates a "minor GC", which pauses the VM for 0.0101110 seconds - aka 10 milliseconds. It has reduced the "ParNew" from about 5.5m to 576k. Later on in this cycle we see:

64901.445: [CMS-concurrent-mark: 1.542/2.492 secs] [Times: user=10.49 sys=0.33, real=2.49 secs] 64901.445: [CMS-concurrent-preclean-start] 64901.453: [GC 64901.453: [ParNew: 5505K->573K(5568K), 0.0062440 secs] 2868746K->2864292K(3061272K), 0.0063360 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 64901.476: [GC 64901.476: [ParNew: 5563K->575K(5568K), 0.0072510 secs] 2869283K->2864837K(3061272K), 0.0073320 secs] [Times: user=0.05 sys=0.01, real=0.01 secs] 64901.500: [GC 64901.500: [ParNew: 5517K->573K(5568K), 0.0120390 secs] 2869780K->2865267K(3061272K), 0.0121150 secs] [Times: user=0.09 sys=0.00, real=0.01 secs] 64901.529: [GC 64901.529: [ParNew: 5507K->569K(5568K), 0.0086240 secs] 2870200K->2865742K(3061272K), 0.0087180 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 64901.554: [GC 64901.555: [ParNew: 5516K->575K(5568K), 0.0107130 secs] 2870689K->2866291K(3061272K), 0.0107820 secs] [Times: user=0.06 sys=0.00, real=0.01 secs] 64901.578: [CMS-concurrent-preclean: 0.070/0.133 secs] [Times: user=0.48 sys=0.01, real=0.14 secs] 64901.578: [CMS-concurrent-abortable-preclean-start] 64901.584: [GC 64901.584: [ParNew: 5504K->571K(5568K), 0.0087270 secs] 2871220K->2866830K(3061272K), 0.0088220 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 64901.609: [GC 64901.609: [ParNew: 5512K->569K(5568K), 0.0063370 secs] 2871771K->2867322K(3061272K), 0.0064230 secs] [Times: user=0.06 sys=0.00, real=0.01 secs] 64901.615: [CMS-concurrent-abortable-preclean: 0.007/0.037 secs] [Times: user=0.13 sys=0.00, real=0.03 secs] 64901.616: [GC[YG occupancy: 645 K (5568 K)]64901.616: [Rescan (parallel) , 0.0020210 secs]64901.618: [weak refs processing, 0.0027950 secs] [1 CMS-remark: 2866753K(3055704K)] 2867399K(3061272K), 0.0049380 secs] [Times: user=0.00 sys=0.01, real=0.01 secs] 64901.621: [CMS-concurrent-sweep-start]

The first line indicates that the CMS concurrent mark (finding garbage) has taken 2.4 seconds. But this is a _concurrent_ 2.4 seconds, Java has not been paused at any point in time.

There are a few more minor GCs, then there is a pause at the 2nd last line:

64901.616: [GC[YG occupancy: 645 K (5568 K)]64901.616: [Rescan (parallel) , 0.0020210 secs]64901.618: [weak refs processing, 0.0027950 secs] [1 CMS-remark: 2866753K(3055704K)] 2867399K(3061272K), 0.0049380 secs] [Times: user=0.00 sys=0.01, real=0.01 secs]

The pause here is 0.0049380 seconds (aka 4.9 milliseconds) to 'remark' the heap.

At this point the sweep starts, and you can watch the heap size go down:

64901.637: [GC 64901.637: [ParNew: 5501K->569K(5568K), 0.0097350 secs] 2871958K->2867441K(3061272K), 0.0098370 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] ... lines removed ... 64904.936: [GC 64904.936: [ParNew: 5532K->568K(5568K), 0.0070720 secs] 1365024K->1360689K(3061272K), 0.0071930 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 64904.953: [CMS-concurrent-sweep: 2.030/3.332 secs] [Times: user=9.57 sys=0.26, real=3.33 secs]

At this point, the CMS sweep took 3.332 seconds, and heap went from about ~ 2.8 GB to 1.3 GB (approximate).

The key points here is to keep all these pauses low. CMS pauses are always low, but if your ParNew starts growing, you can see minor GC pauses approach 100ms, exceed 100ms and hit as high at 400ms.

This can be due to the size of the ParNew, which should be relatively small. If your ParNew is very large after running HBase for a while, in one example a ParNew was about 150MB, then you might have to constrain the size of ParNew (The larger it is, the longer the collections take but if its too small, objects are promoted to old gen too quickly). In the below we constrain new gen size to 64m.

Add this to HBASE_OPTS:

export HBASE_OPTS="-XX:NewSize=64m -XX:MaxNewSize=64m <cms options from above> <gc logging options from above>"

For more information on GC pauses, see the 3 part blog post by Todd Lipcon and Section 11.3.1.1, “Long GC pauses” above.

12.3. Resources

12.3.1. search-hadoop.com

search-hadoop.com indexes all the mailing lists and is great for historical searches. Search here first when you have an issue as its more than likely someone has already had your problem.

12.3.2. Mailing Lists

Ask a question on the HBase mailing lists. The 'dev' mailing list is aimed at the community of developers actually building HBase and for features currently under development, and 'user' is generally used for questions on released versions of HBase. Before going to the mailing list, make sure your question has not already been answered by searching the mailing list archives first. Use Section 12.3.1, “search-hadoop.com”. Take some time crafting your question[28]; a quality question that includes all context and exhibits evidence the author has tried to find answers in the manual and out on lists is more likely to get a prompt response.

12.3.3. IRC

#hbase on irc.freenode.net

12.3.4. JIRA

JIRA is also really helpful when looking for Hadoop/HBase-specific issues.



[28] 參考 Getting Answers

12.4. 工具

12.4.1. Builtin Tools

12.4.1.1. Master Web Interface

The Master starts a web-interface on port 60010 by default.

The Master web UI lists created tables and their definition (e.g., ColumnFamilies, blocksize, etc.). Additionally, the available RegionServers in the cluster are listed along with selected high-level metrics (requests, number of regions, usedHeap, maxHeap). The Master web UI allows navigation to each RegionServer's web UI.

12.4.1.2. RegionServer Web Interface

RegionServers starts a web-interface on port 60030 by default.

The RegionServer web UI lists online regions and their start/end keys, as well as point-in-time RegionServer metrics (requests, regions, storeFileIndexSize, compactionQueueSize, etc.).

參考 Section 14.4, “HBase Metrics” for more information in metric definitions.

12.4.1.3. zkcli

zkcli is a very useful tool for investigating ZooKeeper-related issues. To invoke:

./hbase zkcli -server host:port <cmd> <args>

The commands (and arguments) are:

connect host:port get path [watch] ls path [watch] set path data [version] delquota [-n|-b] path quit printwatches on|off create [-s] [-e] path data acl stat path [watch] close ls2 path [watch] history listquota path setAcl path acl getAcl path sync path redo cmdno addauth scheme auth delete path [version] setquota -n|-b val path

12.4.2. External Tools

12.4.2.1 tail

tail是一個命令行工具,可以用來看日誌的尾巴。加入的"-f"參數後,就會在數據更新的時候自己刷新。用它來看日誌很方便。例如,一個機器需要花很多時間來啓動或關閉,你可以tail他的master log(也可以是region server的log)。

12.4.2.2 top

top是一個很重要的工具來看你的機器各個進程的資源佔用情況。下面是一個生產環境的例子:

top - 14:46:59 up 39 days, 11:55, 1 user, load average: 3.75, 3.57, 3.84 Tasks: 309 total, 1 running, 308 sleeping, 0 stopped, 0 zombie Cpu(s): 4.5%us, 1.6%sy, 0.0%ni, 91.7%id, 1.4%wa, 0.1%hi, 0.6%si, 0.0%st Mem: 24414432k total, 24296956k used, 117476k free, 7196k buffers Swap: 16008732k total,	14348k used, 15994384k free, 11106908k cached PID USER PR NI VIRT RES SHR S %CPU %MEM	TIME+ COMMAND 15558 hadoop	18 -2 3292m 2.4g 3556 S 79 10.4 6523:52 java 13268 hadoop 18 -2 8967m 8.2g 4104 S 21 35.1 5170:30 java 8895 hadoop	18 -2 1581m 497m 3420 S 11 2.1 4002:32 java …

這裏你可以看到系統的load average在最近5分鐘是3.75,意思就是說這5分鐘裏面平均有3.75個線程在CPU時間的等待隊列裏面。通常來說,最完美的情況是這個值和CPU和核數相等,比這個值低意味着資源閒置,比這個值高就是過載了。這是一個重要的概念,要想理解的更多,可以看這篇文章http://www.linuxjournal.com/article/9001.

處理負載,我們可以看到系統已經幾乎使用了他的全部RAM,其中大部分都是用於OS cache(這是一件好事).Swap只使用了一點點KB,這正是我們期望的,如果數值很高的話,就意味着在進行交換,這對Java程序的性能是致命的。另一種檢測交換的方法是看Load average是否過高(load average過高還可能是磁盤損壞或者其它什麼原因導致的)。

默認情況下進程列表不是很有用,我們可以看到3個Java進程使用了111%的CPU。要想知道哪個進程是什麼,可以輸入"c",每一行就會擴展信息。輸入“1”可以顯示CPU的每個核的具體狀況。

12.4.2.3  jps

jps是JDK集成的一個工具,可以用來看當前用戶的Java進程id。(如果是root,可以看到所有用戶的id),例如:

hadoop@sv4borg12:~$ jps 1322 TaskTracker 17789 HRegionServer 27862 Child 1158 DataNode 25115 HQuorumPeer 2950 Jps 19750 ThriftServer 18776 jmx

按順序看

  • Hadoop TaskTracker,管理本地的Task
  • HBase RegionServer,提供region的服務
  • Child, 一個 MapReduce task,無法看出詳細類型
  • Hadoop DataNode, 管理blocks
  • HQuorumPeer, ZooKeeper集羣的成員
  • Jps, 就是這個進程
  • ThriftServer, 當thrif啓動後,就會有這個進程
  • jmx, 這個是本地監控平臺的進程。你可以不用這個。

你可以看到這個進程啓動是全部命令行信息。

hadoop@sv4borg12:~$ ps aux | grep HRegionServer hadoop 17789 155 35.2 9067824 8604364 ? S<l Mar04 9855:48 /usr/java/jdk1.6.0_14/bin/java -Xmx8000m -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -XX:+UseConcMarkSweepGC -XX:NewSize=64m -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/export1/hadoop/logs/gc-hbase.log -Dcom.sun.management.jmxremote.port=10102 -Dcom.sun.management.jmxremote.authenticate=true -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.password.file=/home/hadoop/hbase/conf/jmxremote.password -Dcom.sun.management.jmxremote -Dhbase.log.dir=/export1/hadoop/logs -Dhbase.log.file=hbase-hadoop-regionserver-sv4borg12.log -Dhbase.home.dir=/home/hadoop/hbase -Dhbase.id.str=hadoop -Dhbase.root.logger=INFO,DRFA -Djava.library.path=/home/hadoop/hbase/lib/native/Linux-amd64-64 -classpath /home/hadoop/hbase/bin/../conf:[many jars]:/home/hadoop/hadoop/conf org.apache.hadoop.hbase.regionserver.HRegionServer start

12.4.2.4  jstack

jstack 是一個最重要(除了看Log)的java工具,可以看到具體的Java進程的在做什麼。可以先用Jps看到進程的Id,然後就可以用jstack。他會按線程的創建順序顯示線程的列表,還有這個線程在做什麼。下面是例子:

這個主線程是一個RegionServer正在等master返回什麼信息。

"regionserver60020" prio=10 tid=0x0000000040ab4000 nid=0x45cf waiting on condition [0x00007f16b6a96000..0x00007f16b6a96a70] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00007f16cd5c2f30> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963) at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:395) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:647) at java.lang.Thread.run(Thread.java:619) The MemStore flusher thread that is currently flushing to a file: "regionserver60020.cacheFlusher" daemon prio=10 tid=0x0000000040f4e000 nid=0x45eb in Object.wait() [0x00007f16b5b86000..0x00007f16b5b87af0] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:485) at org.apache.hadoop.ipc.Client.call(Client.java:803) - locked <0x00007f16cb14b3a8> (a org.apache.hadoop.ipc.Client$Call) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:221) at $Proxy1.complete(Unknown Source) at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy1.complete(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3390) - locked <0x00007f16cb14b470> (a org.apache.hadoop.hdfs.DFSClient$DFSOutputStream) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3304) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86) at org.apache.hadoop.hbase.io.hfile.HFile$Writer.close(HFile.java:650) at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.close(StoreFile.java:853) at org.apache.hadoop.hbase.regionserver.Store.internalFlushCache(Store.java:467) - locked <0x00007f16d00e6f08> (a java.lang.Object) at org.apache.hadoop.hbase.regionserver.Store.flushCache(Store.java:427) at org.apache.hadoop.hbase.regionserver.Store.access$100(Store.java:80) at org.apache.hadoop.hbase.regionserver.Store$StoreFlusherImpl.flushCache(Store.java:1359) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:907) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:834) at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:786) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:250) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:224) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:146)

一個處理線程是在等一些東西(例如put, delete, scan...):

"IPC Server handler 16 on 60020" daemon prio=10 tid=0x00007f16b011d800 nid=0x4a5e waiting on condition [0x00007f16afefd000..0x00007f16afefd9f0] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00007f16cd3f8dd8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1013)

有一個線程正在忙,在遞增一個counter(這個階段是正在創建一個scanner來讀最新的值):

"IPC Server handler 66 on 60020" daemon prio=10 tid=0x00007f16b006e800 nid=0x4a90 runnable [0x00007f16acb77000..0x00007f16acb77cf0] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.hbase.regionserver.KeyValueHeap.<init>(KeyValueHeap.java:56) at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:79) at org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:1202) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.<init>(HRegion.java:2209) at org.apache.hadoop.hbase.regionserver.HRegion.instantiateInternalScanner(HRegion.java:1063) at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1055) at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1039) at org.apache.hadoop.hbase.regionserver.HRegion.getLastIncrement(HRegion.java:2875) at org.apache.hadoop.hbase.regionserver.HRegion.incrementColumnValue(HRegion.java:2978) at org.apache.hadoop.hbase.regionserver.HRegionServer.incrementColumnValue(HRegionServer.java:2433) at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:560) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1027)

還有一個線程在從HDFS獲取數據。

"IPC Client (47) connection to sv4borg9/10.4.24.40:9000 from hadoop" daemon prio=10 tid=0x00007f16a02d0000 nid=0x4fa3 runnable [0x00007f16b517d000..0x00007f16b517dbf0] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) - locked <0x00007f17d5b68c00> (a sun.nio.ch.Util$1) - locked <0x00007f17d5b68be8> (a java.util.Collections$UnmodifiableSet) - locked <0x00007f1877959b50> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:332) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:304) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) - locked <0x00007f1808539178> (a java.io.BufferedInputStream) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:569) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:477)

這裏是一個RegionServer死了,master正在試着恢復。

"LeaseChecker" daemon prio=10 tid=0x00000000407ef800 nid=0x76cd waiting on condition [0x00007f6d0eae2000..0x00007f6d0eae2a70] -- java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:485) at org.apache.hadoop.ipc.Client.call(Client.java:726) - locked <0x00007f6d1cd28f80> (a org.apache.hadoop.ipc.Client$Call) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy1.recoverBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2636) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:2832) at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:529) at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:186) at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:530) at org.apache.hadoop.hbase.util.FSUtils.recoverFileLease(FSUtils.java:619) at org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1322) at org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1210) at org.apache.hadoop.hbase.master.HMaster.splitLogAfterStartup(HMaster.java:648) at org.apache.hadoop.hbase.master.HMaster.joinCluster(HMaster.java:572) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:503)

12.4.2.5  OpenTSDB

OpenTSDB是一個Ganglia的很好的替代品,因爲他使用HBase來存儲所有的時序而不需要採樣。使用OpenTSDB來監控你的HBase是一個很好的實踐

這裏有一個例子,集羣正在同時進行上百個compaction,嚴重影響了IO性能。(TODO: 在這裏插入compactionQueueSize的圖片)(譯者注:囧)

給集羣構建一個圖表監控是一個很好的實踐。包括集羣和每臺機器。這樣就可以快速定位到問題。例如,在StumbleUpon,每個機器有一個圖表監控,包括OS和HBase,涵蓋所有的重要的信息。你也可以登錄到機器上,獲取更多的信息。

12.4.2.6  clusterssh+top

clusterssh+top,感覺是一個窮人用的監控系統,但是他確實很有效,當你只有幾臺機器的是,很好設置。啓動clusterssh後,你就會每臺機器有個終端,還有一個終端,你在這個終端的操作都會反應到其他的每一個終端上。 這就意味着,你在一天機器執行“top”,集羣中的所有機器都會給你全部的top信息。你還可以這樣tail全部的log,等等。

12.5. 客戶端

 

HBase 客戶端的更多信息, 參考 Section 9.3, “Client”.

12.5.1. ScannerTimeoutException 或 UnknownScannerException

當從客戶端到RegionServer的RPC請求超時。例如如果Scan.setCacheing的值設置爲500,RPC請求就要去獲取500行的數據,每500次.next()操作獲取一次。因爲數據是以大塊的形式傳到客戶端的,就可能造成超時。將這個 serCacheing的值調小是一個解決辦法,但是這個值要是設的太小就會影響性能。

參考 Section 11.8.1, “Scan Caching”.

12.5.2. 普通操作時,Shell 或客戶端應用拋出很多不太重要的異常

Since 0.20.0 the default log level for org.apache.hadoop.hbase.*is DEBUG.

On your clients, edit $HBASE_HOME/conf/log4j.properties and change this: log4j.logger.org.apache.hadoop.hbase=DEBUG to this:log4j.logger.org.apache.hadoop.hbase=INFO, or even log4j.logger.org.apache.hadoop.hbase=WARN.

12.5.3. 壓縮時客戶端長時暫停

這是一個在HBase區列表中經常被問的問題。場景一般是客戶端正在插入大量數據到相對未優化的HBase集羣中時發生。壓縮正好加重暫停,儘管這不是問題源頭。

參考 Section 11.7.2, “ Table Creation: Pre-Creating Regions ” ,關於預先創建區域的模式部分,並確認表沒有在單個區域中啓動。

參考 Section 11.4, “HBase Configurations” for cluster configuration, particularly hbase.hstore.blockingStoreFiles,hbase.hregion.memstore.block.multiplierMAX_FILESIZE (region size), and MEMSTORE_FLUSHSIZE.

A slightly longer explanation of why pauses can happen is as follows: Puts are sometimes blocked on the MemStores which are blocked by the flusher thread which is blocked because there are too many files to compact because the compactor is given too many small files to compact and has to compact the same data repeatedly. This situation can occur even with minor compactions. Compounding this situation, HBase doesn't compress data in memory. Thus, the 64MB that lives in the MemStore could become a 6MB file after compression - which results in a smaller StoreFile. The upside is that more data is packed into the same region, but performance is achieved by being able to write larger files - which is why HBase waits until the flushize before writing a new StoreFile. And smaller StoreFiles become targets for compaction. Without compression the files are much bigger and don't need as much compaction, however this is at the expense of I/O.

For additional information, see this thread on Long client pauses with compression.

12.5.4. ZooKeeper 客戶端連接錯誤

錯誤類似於...

11/07/05 11:26:41 WARN zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused: no further information at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1078) 11/07/05 11:26:43 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181 11/07/05 11:26:44 WARN zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused: no further information at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1078) 11/07/05 11:26:45 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181

... 要麼是 ZooKeeper 不在了,或網絡不可達問題。

工具 Section 12.4.1.3, “zkcli” 可以幫助調查 ZooKeeper 問題。

12.5.5. 客戶端內存耗盡,但堆大小看起來不太變化( off-heap/direct heap 在增長)

You are likely running into the issue that is described and worked through in the mail thread HBase, mail # user - Suspected memory leak and continued over in HBase, mail # dev - FeedbackRe: Suspected memory leak. A workaround is passing your client-side JVM a reasonable value for -XX:MaxDirectMemorySize. By default, the MaxDirectMemorySize is equal to your -Xmx max heapsize setting (if -Xmx is set). Try seting it to something smaller (for example, one user had success setting it to 1g when they had a client-side heap of 12g). If you set it too small, it will bring onFullGCs so keep it a bit hefty. You want to make this setting client-side only especially if you are running the new experiemental server-side off-heap cache since this feature depends on being able to use big direct buffers (You may have to keep separate client-side and server-side config dirs).

12.5.6. 客戶端變慢,在調用管理方法(flush, compact, 等)時發生

該客戶端問題在 HBASE-5073 版本 0.90.6中修訂。 客戶端的ZooKeeper 內存泄露,而客戶端被管理API的額外調用產生的ZooKeeper事件連續調用。

12.5.7. 安全客戶端不能連接 ([由 GSSException 引起: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)])

There can be several causes that produce this symptom.

First, check that you have a valid Kerberos ticket. One is required in order to set up communication with a secure HBase cluster. Examine the ticket currently in the credential cache, if any, by running the <tt>klist</tt> command line utility. If no ticket is listed, you must obtain a ticket by running the <tt>kinit</tt> command with either a keytab specified, or by interactively entering a password for the desired principal.

Then, consult the <a>Java Security Guide troubleshooting section</a>. The most common problem addressed there is resolved by setting<tt>javax.security.auth.useSubjectCredsOnly</tt> system property value to <tt>false</tt>.

Because of a change in the format in which MIT Kerberos writes its credentials cache, there is a bug in the Oracle JDK 6 Update 26 and earlier that causes Java to be unable to read the Kerberos credentials cache created by versions of MIT Kerberos 1.8.1 or higher. If you have this problematic combination of components in your environment, to work around this problem, first log in with <tt>kinit</tt> and then immediately refresh the credential cache with <tt>kinit -R</tt>. The refresh will rewrite the credential cache without the problematic formatting.

Finally, depending on your Kerberos configuration, you may need to install the <a>Java Cryptography Extension</a>, or JCE. Insure the JCE jars are on the classpath on both server and client systems.

You may also need to download the <a>unlimited strength JCE policy files</a>. Uncompress and extract the downloaded file, and install the policy jars into <tt><java-home>/lib/security</tt>.

12.6. MapReduce

12.6.1. 你認爲自己在用集羣, 實際上你在用本地(Local)

如下的調用棧在使用 ImportTsv時發生,但同樣的事可以在錯誤配置的任何任務中發生。

WARN mapred.LocalJobRunner: job_local_0001 java.lang.IllegalArgumentException: Can't read partitions file at org.apache.hadoop.hbase.mapreduce.hadoopbackport.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:111) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:560) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210) Caused by: java.io.FileNotFoundException: File _partition.lst does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:383) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:776) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1419) at org.apache.hadoop.hbase.mapreduce.hadoopbackport.TotalOrderPartitioner.readPartitions(TotalOrderPartitioner.java:296)

.. 看到調用棧的關鍵部分了嗎?就是...

at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)

LocalJobRunner 意思就是任務跑在本地,不在集羣。

參考 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#classpath for more information on HBase MapReduce jobs and classpaths.

12.7. NameNode

NameNode 更多信息, 參考 Section 9.9, “HDFS”.

12.7.1. 表和區域的HDFS 工具

要確定HBase 用了HDFS多大空間,可在NameNode使用 hadoop shell命令,例如...

hadoop fs -dus /hbase/

...返回全部HBase對象磁盤佔用的情況。

hadoop fs -dus /hbase/myTable

...返回HBase表'myTable'磁盤佔用的情況。

hadoop fs -du /hbase/myTable

...返回HBase的'myTable'表的各區域列表的磁盤佔用情況。

更多關於 HDFS shell 命令的信息,參考 HDFS 文件系統 Shell 文檔.

12.7.2. 瀏覽 HDFS ,查看 HBase 對象

有時需要瀏覽HDFS上的 HBase對象 。對象包括WALs (Write Ahead Logs), 表,區域,存儲文件等。最簡易的方法是在NameNode web應用中查看,端口 50070。NameNode web 應用提供到集羣中所有 DataNode 的鏈接,可以無縫瀏覽。

存儲在HDFS集羣中的HBase表的目錄結構是...

/hbase /<Table> (Tables in the cluster) /<Region> (Regions for the table) /<ColumnFamiy>(ColumnFamilies for the Region for the table) /<StoreFile> (StoreFiles for the ColumnFamily for the Regions for the table)

HDFS 中的HBase WAL目錄結構是..

/hbase /.logs /<RegionServer> (RegionServers) /<HLog> (WAL HLog files for the RegionServer)

參考HDFS User Guide 獲取其他非Shell診斷工具如fsck.

12.7.2.1. 用例

查詢HDFS,獲取HBase對象的兩個通常用例是研究未壓縮表的程度。如果每個列族有大量存儲文件(StoreFile),這表示需要主壓縮。另外,在主壓縮之後,如果存儲文件的比較小,意味着表的列族要減少。

12.8. 網絡

12.8.1. 網絡峯值(Network Spikes)

如果看到週期性網絡峯值,你可能需要檢查compactionQueues,是不是主壓縮正在進行。

參考 Section 2.8.2.8, “Managed Compactions” ,獲取更多管理壓縮的信息。

12.8.2. 迴環IP(Loopback IP)

HBase 希望迴環 IP 地址是 127.0.0.1. 參考開始章節 Section 2.2.3, “Loopback IP”.

12.8.3. 網絡接口

所有網絡接口是否正常?你確定嗎?參考故障診斷用例研究 Section 12.14, “Case Studies”.

12.9. 區域服務器

RegionServer 的更多信息,參考 Section 9.6, “RegionServer”.

12.9.1. 啓動錯誤

12.9.1.1. 主服務器啓動了,但區域服務器沒有啓動

主服務器相信區域服務器有IP地址127.0.0.1 - 這是 localhost 並被解析到主服務器自己的localhost.

區域服務器錯誤的通知主服務器他們的IP地址是127.0.0.1.

修改區域服務器的 /etc/hosts 從...

# Do not remove the following line, or various programs # that require network functionality will fail. 127.0.0.1 fully.qualified.regionservername regionservername localhost.localdomain localhost ::1 localhost6.localdomain6 localhost6

... 改到 (將主名稱從localhost中移掉)...

# Do not remove the following line, or various programs # that require network functionality will fail. 127.0.0.1 localhost.localdomain localhost ::1 localhost6.localdomain6 localhost6

12.9.1.2. Compression Link Errors

因爲LZO壓縮算法需要在集羣中的每臺機器都要安裝,這是一個啓動失敗的常見錯誤。如果你獲得瞭如下信息

11/02/20 01:32:15 ERROR lzo.GPLNativeCodeLoader: Could not load native gpl library java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1734) at java.lang.Runtime.loadLibrary0(Runtime.java:823) at java.lang.System.loadLibrary(System.java:1028)

就意味着你的壓縮庫出現了問題。參見配置章節的 LZO compression configuration.

12.9.2. 運行時錯誤

12.9.2.1. RegionServer Hanging

Are you running an old JVM (< 1.6.0_u21?)? When you look at a thread dump, does it look like threads are BLOCKED but no one holds the lock all are blocked on? 參考 HBASE 3622 Deadlock in HBaseServer (JVM bug?). Adding -XX:+UseMembar to the HBase HBASE_OPTS in conf/hbase-env.sh may fix it.

Also, are you using Section 9.3.4, “RowLocks”? These are discouraged because they can lock up the RegionServers if not managed properly.

12.9.2.2. java.io.IOException...(Too many open files)

If you see log messages like this...

2010-09-13 01:24:17,336 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Disk-related IOException in BlockReceiver constructor. Cause is java.io.IOException: Too many open files at java.io.UnixFileSystem.createFileExclusively(Native Method) at java.io.File.createNewFile(File.java:883)

... 參見快速入門的章節 ulimit and nproc configuration.

12.9.2.3. xceiverCount 258 exceeds the limit of concurrent xcievers 256

這個時常會出現在DataNode的日誌中。

參見快速入門章節的 xceivers configuration.

 

12.9.2.4. 系統不穩定,DataNode或者其他系統進程有 "java.lang.OutOfMemoryError: unable to create new native thread in exceptions"的錯誤

參見快速入門章節的 ulimit and nproc configuration.. The default on recent Linux distributions is 1024 - which is far too low for HBase.

12.9.2.5. DFS不穩定或者RegionServer租期超時

如果你收到了如下的消息:

2009-02-24 10:01:33,516 WARN org.apache.hadoop.hbase.util.Sleeper: We slept xxx ms, ten times longer than scheduled: 10000 2009-02-24 10:01:33,516 WARN org.apache.hadoop.hbase.util.Sleeper: We slept xxx ms, ten times longer than scheduled: 15000 2009-02-24 10:01:36,472 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report to master for xxx milliseconds - retrying

... 或者看到了全GC壓縮操作,你可能正在執行一個全GC。

12.9.2.6. "No live nodes contain current block" and/or YouAreDeadException

這個錯誤有可能是OS的文件句柄溢出,也可能是網絡故障導致節點無法訪問。

參見快速入門章節 ulimit and nproc configuration,檢查你的網絡。

12.9.2.7. ZooKeeper SessionExpired events

Master or RegionServers shutting down with messages like those in the logs:

WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x278bd16a96000f to sun.nio.ch.SelectionKeyImpl@355811ec java.io.IOException: TIMED OUT at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906) WARN org.apache.hadoop.hbase.util.Sleeper: We slept 79410ms, ten times longer than scheduled: 5000 INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server hostname/IP:PORT INFO org.apache.zookeeper.ClientCnxn: Priming connection to java.nio.channels.SocketChannel[connected local=/IP:PORT remote=hostname/IP:PORT] INFO org.apache.zookeeper.ClientCnxn: Server connection successful WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x278bd16a96000d to sun.nio.ch.SelectionKeyImpl@3544d65e java.io.IOException: Session Expired at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589) at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired

The JVM is doing a long running garbage collecting which is pausing every threads (aka "stop the world"). Since the RegionServer's local ZooKeeper client cannot send heartbeats, the session times out. By design, we shut down any node that isn't able to contact the ZooKeeper ensemble after getting a timeout so that it stops serving data that may already be assigned elsewhere.

  • Make sure you give plenty of RAM (in hbase-env.sh), the default of 1GB won't be able to sustain long running imports.
  • Make sure you don't swap, the JVM never behaves well under swapping.
  • Make sure you are not CPU starving the RegionServer thread. For example, if you are running a MapReduce job using 6 CPU-intensive tasks on a machine with 4 cores, you are probably starving the RegionServer enough to create longer garbage collection pauses.
  • Increase the ZooKeeper session timeout

If you wish to increase the session timeout, add the following to your hbase-site.xml to increase the timeout from the default of 60 seconds to 120 seconds.

<property> <name>zookeeper.session.timeout</name> <value>1200000</value> </property> <property> <name>hbase.zookeeper.property.tickTime</name> <value>6000</value> </property>

Be aware that setting a higher timeout means that the regions served by a failed RegionServer will take at least that amount of time to be transfered to another RegionServer. For a production system serving live requests, we would instead recommend setting it lower than 1 minute and over-provision your cluster in order the lower the memory load on each machines (hence having less garbage to collect per machine).

If this is happening during an upload which only happens once (like initially loading all your data into HBase), consider bulk loading.

參考 Section 12.11.2, “ZooKeeper, The Cluster Canary” for other general information about ZooKeeper troubleshooting.

12.9.2.8. NotServingRegionException

This exception is "normal" when found in the RegionServer logs at DEBUG level. This exception is returned back to the client and then the client goes back to .META. to find the new location of the moved region.

However, if the NotServingRegionException is logged ERROR, then the client ran out of retries and something probably wrong.

12.9.2.9. Regions listed by domain name, then IP

Fix your DNS. In versions of HBase before 0.92.x, reverse DNS needs to give same answer as forward lookup. 參考 HBASE 3431 RegionServer is not using the name given it by the master; double entry in master listing of servers for gorey details.

12.9.2.10. Logs flooded with '2011-01-10 12:40:48,407 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor' messages

We are not using the native versions of compression libraries. 參考 HBASE-1900 Put back native support when hadoop 0.21 is released. Copy the native libs from hadoop under hbase lib dir or symlink them into place and the message should go away.

12.9.2.11. Server handler X on 60020 caught: java.nio.channels.ClosedChannelException

If you see this type of message it means that the region server was trying to read/send data from/to a client but it already went away. Typical causes for this are if the client was killed (you see a storm of messages like this when a MapReduce job is killed or fails) or if the client receives a SocketTimeoutException. It's harmless, but you should consider digging in a bit more if you aren't doing something to trigger them.

12.9.3. 終止錯誤

12.10. Master

For more information on the Master, see Section 9.5, “Master”.

12.10.1. 啓動錯誤

12.10.1.1. Master says that you need to run the hbase migrations script

Upon running that, the hbase migrations script says no files in root directory.

HBase expects the root directory to either not exist, or to have already been initialized by hbase running a previous time. If you create a new directory for HBase using Hadoop DFS, this error will occur. Make sure the HBase root directory does not currently exist or has been initialized by a previous run of HBase. Sure fire solution is to just use Hadoop dfs to delete the HBase root and let HBase create and initialize the directory itself.

12.10.2. 終止錯誤

12.11. ZooKeeper

12.11.1. 啓動錯誤

12.11.1.1. Could not find my address: xyz in list of ZooKeeper quorum servers

A ZooKeeper server wasn't able to start, throws that error. xyz is the name of your server.

This is a name lookup problem. HBase tries to start a ZooKeeper server on some machine but that machine isn't able to find itself in the hbase.zookeeper.quorumconfiguration.

Use the hostname presented in the error message instead of the value you used. If you have a DNS server, you can set hbase.zookeeper.dns.interface andhbase.zookeeper.dns.nameserver in hbase-site.xml to make sure it resolves to the correct FQDN.

12.11.2. ZooKeeper, The Cluster Canary

ZooKeeper is the cluster's "canary in the mineshaft". It'll be the first to notice issues if any so making sure its happy is the short-cut to a humming cluster.

參考 the ZooKeeper Operating Environment Troubleshooting page. It has suggestions and tools for checking disk and networking performance; i.e. the operating environment your ZooKeeper and HBase are running in.

Additionally, the utility Section 12.4.1.3, “zkcli” may help investigate ZooKeeper issues.

 

12.12. Amazon EC2

12.12.1. ZooKeeper 在 Amazon EC2上看起來不工作?

HBase does not start when deployed as Amazon EC2 instances. Exceptions like the below appear in the Master and/or RegionServer logs:

2009-10-19 11:52:27,030 INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server ec2-174-129-15-236.compute-1.amazonaws.com/10.244.9.171:2181 2009-10-19 11:52:27,032 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x0 to sun.nio.ch.SelectionKeyImpl@656dc861 java.net.ConnectException: Connection refused

Security group policy is blocking the ZooKeeper port on a public address. Use the internal EC2 host names when configuring the ZooKeeper quorum peer list.

12.12.2. Instability on Amazon EC2

Questions on HBase and Amazon EC2 come up frequently on the HBase dist-list. Search for old threads using Search Hadoop

12.12.3. Remote Java Connection into EC2 Cluster Not Working

參考 Andrew's answer here, up on the user list: Remote Java client connection into EC2 instance.

12.13. HBase and Hadoop version issues

12.13.1. NoClassDefFoundError when trying to run 0.90.x on hadoop-0.20.205.x (or hadoop-1.0.x)

HBase 0.90.x does not ship with hadoop-0.20.205.x, etc. To make it run, you need to replace the hadoop jars that HBase shipped with in its libdirectory with those of the Hadoop you want to run HBase on. If even after replacing Hadoop jars you get the below exception:

sv4r6s38: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/configuration/Configuration sv4r6s38: at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<init>(DefaultMetricsSystem.java:37) sv4r6s38: at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<clinit>(DefaultMetricsSystem.java:34) sv4r6s38: at org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51) sv4r6s38: at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:209) sv4r6s38: at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:177) sv4r6s38: at org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:229) sv4r6s38: at org.apache.hadoop.security.KerberosName.<clinit>(KerberosName.java:83) sv4r6s38: at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:202) sv4r6s38: at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:177)

you need to copy under hbase/lib, the commons-configuration-X.jar you find in your Hadoop's lib directory. That should fix the above complaint.

12.14. 案例研究

For Performance and Troubleshooting Case Studies, see Chapter 13, Case Studies.


[29] 參考 Getting Answers

Chapter 13. 用例研究

13.1. 概述

This chapter will describe a variety of performance and troubleshooting case studies that can provide a useful blueprint on diagnosing cluster issues.

For more information on Performance and Troubleshooting, see Chapter 11, Performance Tuning and Chapter 12, Troubleshooting and Debugging HBase.

13.2. 模式設計

13.2.1. 數據列表

The following is an exchange from the user dist-list regarding a fairly common question: how to handle per-user list data in HBase.

*** QUESTION ***

We're looking at how to store a large amount of (per-user) list data in HBase, and we were trying to figure out what kind of access pattern made the most sense. One option is store the majority of the data in a key, so we could have something like:

<FixedWidthUserName><FixedWidthValueId1>:"" (no value) <FixedWidthUserName><FixedWidthValueId2>:"" (no value) <FixedWidthUserName><FixedWidthValueId3>:"" (no value)
The other option we had was to do this entirely using:
<FixedWidthUserName><FixedWidthPageNum0>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>... <FixedWidthUserName><FixedWidthPageNum1>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...

where each row would contain multiple values. So in one case reading the first thirty values would be:

scan { STARTROW => 'FixedWidthUsername' LIMIT => 30}
And in the second case it would be
get 'FixedWidthUserName\x00\x00\x00\x00'

The general usage pattern would be to read only the first 30 values of these lists, with infrequent access reading deeper into the lists. Some users would have <= 30 total values in these lists, and some users would have millions (i.e. power-law distribution)

The single-value format seems like it would take up more space on HBase, but would offer some improved retrieval / pagination flexibility. Would there be any significant performance advantages to be able to paginate via gets vs paginating with scans?

My initial understanding was that doing a scan should be faster if our paging size is unknown (and caching is set appropriately), but that gets should be faster if we'll always need the same page size. I've ended up hearing different people tell me opposite things about performance. I assume the page sizes would be relatively consistent, so for most use cases we could guarantee that we only wanted one page of data in the fixed-page-length case. I would also assume that we would have infrequent updates, but may have inserts into the middle of these lists (meaning we'd need to update all subsequent rows).

Thanks for help / suggestions / follow-up questions.

*** ANSWER ***

If I understand you correctly, you're ultimately trying to store triples in the form "user, valueid, value", right? E.g., something like:

"user123, firstname, Paul", "user234, lastname, Smith"

(But the usernames are fixed width, and the valueids are fixed width).

And, your access pattern is along the lines of: "for user X, list the next 30 values, starting with valueid Y". Is that right? And these values should be returned sorted by valueid?

The tl;dr version is that you should probably go with one row per user+value, and not build a complicated intra-row pagination scheme on your own unless you're really sure it is needed.

Your two options mirror a common question people have when designing HBase schemas: should I go "tall" or "wide"? Your first schema is "tall": each row represents one value for one user, and so there are many rows in the table for each user; the row key is user + valueid, and there would be (presumably) a single column qualifier that means "the value". This is great if you want to scan over rows in sorted order by row key (thus my question above, about whether these ids are sorted correctly). You can start a scan at any user+valueid, read the next 30, and be done. What you're giving up is the ability to have transactional guarantees around all the rows for one user, but it doesn't sound like you need that. Doing it this way is generally recommended (see here #schema.smackdown).

Your second option is "wide": you store a bunch of values in one row, using different qualifiers (where the qualifier is the valueid). The simple way to do that would be to just store ALL values for one user in a single row. I'm guessing you jumped to the "paginated" version because you're assuming that storing millions of columns in a single row would be bad for performance, which may or may not be true; as long as you're not trying to do too much in a single request, or do things like scanning over and returning all of the cells in the row, it shouldn't be fundamentally worse. The client has methods that allow you to get specific slices of columns.

Note that neither case fundamentally uses more disk space than the other; you're just "shifting" part of the identifying information for a value either to the left (into the row key, in option one) or to the right (into the column qualifiers in option 2). Under the covers, every key/value still stores the whole row key, and column family name. (If this is a bit confusing, take an hour and watch Lars George's excellent video about understanding HBase schema design: http://www.youtube.com/watch?v=_HLoH_PgrLk).

A manually paginated version has lots more complexities, as you note, like having to keep track of how many things are in each page, re-shuffling if new values are inserted, etc. That seems significantly more complex. It might have some slight speed advantages (or disadvantages!) at extremely high throughput, and the only way to really know that would be to try it out. If you don't have time to build it both ways and compare, my advice would be to start with the simplest option (one row per user+value). Start simple and iterate! :)

13.3. 性能/故障診斷

13.3.1. 用例 #1 (單節點性能問題)

13.3.1.1.場景

Following a scheduled reboot, one data node began exhibiting unusual behavior. Routine MapReduce jobs run against HBase tables which regularly completed in five or six minutes began taking 30 or 40 minutes to finish. These jobs were consistently found to be waiting on map and reduce tasks assigned to the troubled data node (e.g., the slow map tasks all had the same Input Split). The situation came to a head during a distributed copy, when the copy was severely prolonged by the lagging node.

13.3.1.2. 硬件

Datanodes:

  • Two 12-core processors
  • Six Enerprise SATA disks
  • 24GB of RAM
  • Two bonded gigabit NICs

Network:

  • 10 Gigabit top-of-rack switches
  • 20 Gigabit bonded interconnects between racks.

13.3.1.3. 假設

13.3.1.3.1. HBase "熱點" 區域

We hypothesized that we were experiencing a familiar point of pain: a "hot spot" region in an HBase table, where uneven key-space distribution can funnel a huge number of requests to a single HBase region, bombarding the RegionServer process and cause slow response time. Examination of the HBase Master status page showed that the number of HBase requests to the troubled node was almost zero. Further, examination of the HBase logs showed that there were no region splits, compactions, or other region transitions in progress. This effectively ruled out a "hot spot" as the root cause of the observed slowness.

13.3.1.3.2. HBase 分區具有非本地數據

Our next hypothesis was that one of the MapReduce tasks was requesting data from HBase that was not local to the datanode, thus forcing HDFS to request data blocks from other servers over the network. Examination of the datanode logs showed that there were very few blocks being requested over the network, indicating that the HBase region was correctly assigned, and that the majority of the necessary data was located on the node. This ruled out the possibility of non-local data causing a slowdown.

13.3.1.3.3. Excessive I/O Wait Due To Swapping Or An Over-Worked Or Failing Hard Disk

After concluding that the Hadoop and HBase were not likely to be the culprits, we moved on to troubleshooting the datanode's hardware. Java, by design, will periodically scan its entire memory space to do garbage collection. If system memory is heavily overcommitted, the Linux kernel may enter a vicious cycle, using up all of its resources swapping Java heap back and forth from disk to RAM as Java tries to run garbage collection. Further, a failing hard disk will often retry reads and/or writes many times before giving up and returning an error. This can manifest as high iowait, as running processes wait for reads and writes to complete. Finally, a disk nearing the upper edge of its performance envelope will begin to cause iowait as it informs the kernel that it cannot accept any more data, and the kernel queues incoming data into the dirty write pool in memory. However, using vmstat(1) and free(1), we could see that no swap was being used, and the amount of disk IO was only a few kilobytes per second.

13.3.1.3.4. Slowness Due To High Processor Usage

Next, we checked to see whether the system was performing slowly simply due to very high computational load. top(1) showed that the system load was higher than normal, but vmstat(1) and mpstat(1) showed that the amount of processor being used for actual computation was low.

13.3.1.3.5. Network Saturation (The Winner)

Since neither the disks nor the processors were being utilized heavily, we moved on to the performance of the network interfaces. The datanode had two gigabit ethernet adapters, bonded to form an active-standby interface. ifconfig(8) showed some unusual anomalies, namely interface errors, overruns, framing errors. While not unheard of, these kinds of errors are exceedingly rare on modern hardware which is operating as it should:

$ /sbin/ifconfig bond0 bond0 Link encap:Ethernet HWaddr 00:00:00:00:00:00 inet addr:10.x.x.x Bcast:10.x.x.255 Mask:255.255.255.0 UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:2990700159 errors:12 dropped:0 overruns:1 frame:6 <--- Look Here! Errors! TX packets:3443518196 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:2416328868676 (2.4 TB) TX bytes:3464991094001 (3.4 TB)

These errors immediately lead us to suspect that one or more of the ethernet interfaces might have negotiated the wrong line speed. This was confirmed both by running an ICMP ping from an external host and observing round-trip-time in excess of 700ms, and by running ethtool(8) on the members of the bond interface and discovering that the active interface was operating at 100Mbs/, full duplex.

$ sudo ethtool eth0 Settings for eth0: Supported ports: [ TP ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Advertised pause frame use: No Advertised auto-negotiation: Yes Link partner advertised link modes: Not reported Link partner advertised pause frame use: No Link partner advertised auto-negotiation: No Speed: 100Mb/s <--- Look Here! Should say 1000Mb/s! Duplex: Full Port: Twisted Pair PHYAD: 1 Transceiver: internal Auto-negotiation: on MDI-X: Unknown Supports Wake-on: umbg Wake-on: g Current message level: 0x00000003 (3) Link detected: yes

In normal operation, the ICMP ping round trip time should be around 20ms, and the interface speed and duplex should read, "1000MB/s", and, "Full", respectively.

13.3.1.4. 結論

After determining that the active ethernet adapter was at the incorrect speed, we used the ifenslave(8) command to make the standby interface the active interface, which yielded an immediate improvement in MapReduce performance, and a 10 times improvement in network throughput:

On the next trip to the datacenter, we determined that the line speed issue was ultimately caused by a bad network cable, which was replaced.

13.3.2. 用例 #2 (性能研究 2012)

Investigation results of a self-described "we're not sure what's wrong, but it seems slow" problem. http://gbif.blogspot.com/2012/03/hbase-performance-evaluation-continued.html

13.3.3. Case Study #3 (Performance Research 2010))

Investigation results of general cluster performance from 2010. Although this research is on an older version of the codebase, this writeup is still very useful in terms of approach. http://hstack.org/hbase-performance-testing/

13.3.4. Case Study #4 (xcievers Config)

Case study of configuring xceivers, and diagnosing errors from mis-configurations. http://www.larsgeorge.com/2012/03/hadoop-hbase-and-xceivers.html

參考 also Section 2.3.2, “dfs.datanode.max.xcievers.

Chapter 14. HBase 運維管理

This chapter will cover operational tools and practices required of a running HBase cluster. The subject of operations is related to the topics of Chapter 12, Troubleshooting and Debugging HBaseChapter 11, Performance Tuning, and Chapter 2, Configuration but is a distinct topic in itself.

14.1. HBase 工具和實用程序

Here we list HBase tools for administration, analysis, fixup, and debugging.

14.1.1. Driver

There is a Driver class that is executed by the HBase jar can be used to invoke frequently accessed utilities. For example,

HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar

... will return...

An example program must be given as the first argument. Valid program names are: completebulkload: Complete a bulk data load. copytable: Export a table from local cluster to peer cluster export: Write table data to HDFS. import: Import data written by Export. importtsv: Import data in TSV format. rowcounter: Count rows in HBase table verifyrep: Compare the data from tables in two different clusters. WARNING: It doesn't work for incrementColumnValues'd cells since the timestamp is chan

... for allowable program names.

14.1.2. HBase hbck

An fsck for your HBase install

To run hbck against your HBase cluster run

$ ./bin/hbase hbck

At the end of the commands output it prints OK or INCONSISTENCY. If your cluster reports inconsistencies, pass -details to see more detail emitted. If inconsistencies, run hbck a few times because the inconsistency may be transient (e.g. cluster is starting up or a region is splitting). Passing -fix may correct the inconsistency (This latter is an experimental feature).

For more information, see Appendix B, hbck In Depth.

14.1.3. HFile 工具

參考 Section 9.7.5.2.2, “HFile Tool”.

14.1.4. WAL 工具

14.1.4.1. HLog tool

The main method on HLog offers manual split and dump facilities. Pass it WALs or the product of a split, the content of the recovered.edits. directory.

You can get a textual dump of a WAL file content by doing the following:

$ ./bin/hbase org.apache.hadoop.hbase.regionserver.wal.HLog --dump hdfs://example.org:8020/hbase/.logs/example.org,60020,1283516293161/10.10.21.10%3A60020.1283973724012

The return code will be non-zero if issues with the file so you can test wholesomeness of file by redirecting STDOUT to /dev/null and testing the program return.

Similarly you can force a split of a log file directory by doing:

$ ./bin/hbase org.apache.hadoop.hbase.regionserver.wal.HLog --split hdfs://example.org:8020/hbase/.logs/example.org,60020,1283516293161/
14.1.4.1.1. HLogPrettyPrinter

HLogPrettyPrinter is a tool with configurable options to print the contents of an HLog.

14.1.5. Compression Tool

參考 Section C.1, “CompressionTest Tool”.

14.1.6. CopyTable

CopyTable is a utility that can copy part or of all of a table, either to the same cluster or another cluster. The usage is as follows:

$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] tablename

Options:

  • starttime Beginning of the time range. Without endtime means starttime to forever.
  • endtime End of the time range. Without endtime means starttime to forever.
  • versions Number of cell versions to copy.
  • new.name New table's name.
  • peer.adr Address of the peer cluster given in the format hbase.zookeeper.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent
  • families Comma-separated list of ColumnFamilies to copy.
  • all.cells Also copy delete markers and uncollected deleted cells (advanced option).

Args:

  • tablename Name of table to copy.

Example of copying 'TestTable' to a cluster that uses replication for a 1 hour window:

$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 --peer.adr=server1,server2,server3:2181:/hbase TestTable

Scanner Caching

Caching for the input Scan is configured via hbase.client.scanner.caching in the job configuration.

參考 Jonathan Hsieh's Online HBase Backups with CopyTable blog post for more on CopyTable.

14.1.7. 導出

導出實用工具可以將表的內容輸出成HDFS的序列化文件,如下調用:

$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]

Note: caching for the input Scan is configured via hbase.client.scanner.caching in the job configuration.

14.1.8. 導入

導入實用工具可以加載導出的數據回到HBase,如下調用:

$ bin/hbase org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>

14.1.9. ImportTsv

ImportTsv is a utility that will load data in TSV format into HBase. It has two distinct usages: loading data from TSV format in HDFS into HBase via Puts, and preparing StoreFiles to be loaded via the completebulkload.

To load data via Puts (i.e., non-bulk loading):

$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c <tablename> <hdfs-inputdir>

To generate StoreFiles for bulk-loading:

$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c -Dimporttsv.bulk.output=hdfs://storefile-outputdir <tablename> <hdfs-data-inputdir>

These generated StoreFiles can be loaded into HBase via Section 14.1.10, “CompleteBulkLoad”.

14.1.9.1. ImportTsv 選項

Running ImportTsv with no arguments prints brief usage information:
Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir> Imports the given input directory of TSV data into the specified table. The column names of the TSV data must be specified using the -Dimporttsv.columns option. This option takes the form of comma-separated column names, where each column name is either a simple column family, or a columnfamily:qualifier. The special column name HBASE_ROW_KEY is used to designate that this column should be used as the row key for each imported record. You must specify exactly one column to be the row key, and you must specify a column name for every column that exists in the input data. By default importtsv will load data directly into HBase. To instead generate HFiles of data to prepare for a bulk data load, pass the option: -Dimporttsv.bulk.output=/path/for/output Note: if you do not use this option, then the target table must already exist in HBase Other options that may be specified with -D include: -Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line '-Dimporttsv.separator=|' - eg separate on pipes instead of tabs -Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import -Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper

14.1.9.2. ImportTsv 示例

For example, assume that we are loading data into a table called 'datatsv' with a ColumnFamily called 'd' with two columns "c1" and "c2".

Assume that an input file exists as follows:

row1	c1	c2 row2	c1	c2 row3	c1	c2 row4	c1	c2 row5	c1	c2 row6	c1	c2 row7	c1	c2 row8	c1	c2 row9	c1 c2 row10	c1	c2

For ImportTsv to use this imput file, the command line needs to look like this:

HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,d:c1,d:c2 -Dimporttsv.bulk.output=hdfs://storefileoutput datatsv hdfs://inputfile

... and in this example the first column is the rowkey, which is why the HBASE_ROW_KEY is used. The second and third columns in the file will be imported as "d:c1" and "d:c2", respectively.

14.1.9.3. ImportTsv Warning

If you have preparing a lot of data for bulk loading, make sure the target HBase table is pre-split appropriately.

14.1.9.4. 參考

For more information about bulk-loading HFiles into HBase, see Section 9.8, “Bulk Loading”

14.1.10. CompleteBulkLoad

completebulkload 實用工具可以將產生的存儲文件移動到HBase表。該工具經常和Section 14.1.9, “ImportTsv” 的輸出聯合使用。

兩種方法調用該工具,帶顯式類名或通過驅動:

$ bin/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles <hdfs://storefileoutput> <tablename>

.. 通過驅動..

HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar completebulkload <hdfs://storefileoutput> <tablename>

批量導入 HFiles 到 HBase的更多信息 ,參考 Section 9.8, “Bulk Loading”.

14.1.11. WALPlayer

WALPlayer 實用工具可以重放 WAL 文件到 HBase.

The WAL can be replayed for a set of tables or all tables, and a timerange can be provided (in milliseconds). The WAL is filtered to this set of tables. The output can optionally be mapped to another set of tables.

WALPlayer can also generate HFiles for later bulk importing, in that case only a single table and no mapping can be specified.

Invoke via:

$ bin/hbase org.apache.hadoop.hbase.mapreduce.WALPlayer [options] <wal inputdir> <tables> [<tableMappings>]>

For example:

$ bin/hbase org.apache.hadoop.hbase.mapreduce.WALPlayer /backuplogdir oldTable1,oldTable2 newTable1,newTable2

14.1.12. RowCounter

RowCounter 實用工具可以統計表的行數。這是一個好工具,如果擔心元數據可能存在不一致,可以用於確認HBase可以讀取表的所有分塊。

$ bin/hbase org.apache.hadoop.hbase.mapreduce.RowCounter <tablename> [<column1> <column2>...]

Note: caching for the input Scan is configured via hbase.client.scanner.caching in the job configuration.

14.2. 區域管理

14.2.1. 主壓縮

主壓縮可以通過HBase shell 或 HBaseAdmin.majorCompact 進行。

注意:主壓縮進行區域合併。更多關於壓縮的信息,參考 Section 9.7.5.5, “Compaction”

14.2.2. 合併

Merge is a utility that can merge adjoining regions in the same table (see org.apache.hadoop.hbase.util.Merge).

$ bin/hbase org.apache.hbase.util.Merge <tablename> <region1> <region2>

If you feel you have too many regions and want to consolidate them, Merge is the utility you need. Merge must run be done when the cluster is down. 參考 the O'Reilly HBase Book for an example of usage.

Additionally, there is a Ruby script attached to HBASE-1621 for region merging.

14.3. 節點管理

14.3.1. 節點下線

你可以在HBase的特定的節點上運行下面的腳本來停止RegionServer:

$ ./bin/hbase-daemon.sh stop regionserver

RegionServer會首先關閉所有的region然後把它自己關閉,在停止的過程中,RegionServer的會向Zookeeper報告說他已經過期了。master會發現RegionServer已經死了,會把它當作崩潰的server來處理。他會將region分配到其他的節點上去。

在下線節點之前要停止Load Balancer

如果在運行load balancer的時候,一個節點要關閉, 則Load Balancer和Master的recovery可能會爭奪這個要下線的Regionserver。爲了避免這個問題,先將load balancer停止,參見下面的 Load Balancer.

RegionServer下線有一個缺點就是其中的Region會有好一會離線。Regions是被按順序關閉的。如果一個server上有很多region,從第一個region會被下線,到最後一個region被關閉,並且Master確認他已經死了,該region纔可以上線,整個過程要花很長時間。在HBase 0.90.2中,我們加入了一個功能,可以讓節點逐漸的擺脫他的負載,最後關閉。HBase 0.90.2加入了 graceful_stop.sh腳本,可以這樣用,

$ ./bin/graceful_stop.sh Usage: graceful_stop.sh [--config &conf-dir>] [--restart] [--reload] [--thrift] [--rest] &hostname> thrift If we should stop/start thrift before/after the hbase stop/start rest If we should stop/start rest before/after the hbase stop/start restart If we should restart after graceful stop reload Move offloaded regions back on to the stopped server debug Move offloaded regions back on to the stopped server hostname Hostname of server we are to stop

要下線一臺RegionServer可以這樣做

$ ./bin/graceful_stop.sh HOSTNAME

這裏的HOSTNAME是RegionServer的host you would decommission.

On HOSTNAME

傳遞到graceful_stop.shHOSTNAME必須和hbase使用的hostname一致,hbase用它來區分RegionServers。可以用master的UI來檢查RegionServers的id。通常是hostname,也可能是FQDN。不管HBase使用的哪一個,你可以將它傳到 graceful_stop.sh腳本中去,目前他還不支持使用IP地址來推斷hostname。所以使用IP就會發現server不在運行,也沒有辦法下線了。

graceful_stop.sh 腳本會一個一個將region從RegionServer中移除出去,以減少改RegionServer的負載。他會先移除一個region,然後再將這個region安置到一個新的地方,再移除下一個,直到全部移除。最後graceful_stop.sh腳本會讓RegionServer stop.,Master會注意到RegionServer已經下線了,這個時候所有的region已經重新部署好。RegionServer就可以乾乾淨淨的結束,沒有WAL日誌需要分割。

Load Balancer

當執行graceful_stop腳本的時候,要將Region Load Balancer關掉(否則balancer和下線腳本會在region部署的問題上存在衝突):

hbase(main):001:0> balance_switch false true 0 row(s) in 0.3590 seconds

上面是將balancer關掉,要想開啓:

hbase(main):001:0> balance_switch true false 0 row(s) in 0.3590 seconds

14.3.2. 依次重啓

你還可以讓這個腳本重啓一個RegionServer,不改變上面的Region的位置。要想保留數據的位置,你可以依次重啓(Rolling Restart),就像這樣:

$ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt &

Tail /tmp/log.txt來看腳本的運行過程.上面的腳本只對RegionServer進行操作。要確認load balancer已經關掉。還需要在之前更新master。下面是一段依次重啓的僞腳本,你可以借鑑它:

  1. 確認你的版本,保證配置已經rsync到整個集羣中。如果版本是0.90.2,需要打上HBASE-3744 和 HBASE-3756兩個補丁。

  2. 運行hbck確保你的集羣是一致的

    $ ./bin/hbase hbck

    當發現不一致的時候,可以修復他。

  3. 重啓Master:

    $ ./bin/hbase-daemon.sh stop master; ./bin/hbase-daemon.sh start master

  4. 關閉region balancer:

    $ echo "balance_switch false" | ./bin/hbase

  5. 在每個RegionServer上運行graceful_stop.sh

    $ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt &

    如果你在RegionServer還開起來thrift和rest server。還需要加上--thrift or --rest 選項 (參見 graceful_stop.sh 腳本的用法).

  6. 再次重啓Master.這會把已經死亡的server列表清空,重新開啓balancer.

  7. 運行 hbck 保證集羣是一直的

14.4. HBase 度量

14.4.1. 度量安裝

參見 Metrics 可以獲得一個enable Metrics emission的指導。

14.4.2. 區域服務器度量

14.4.2.1. hbase.regionserver.blockCacheCount

內存中的Block cache item數量。這個是存儲文件(HFiles)的緩存中的數量。

14.4.2.2. hbase.regionserver.blockCacheEvictedCount

Number of blocks that had to be evicted from the block cache due to heap size constraints.

14.4.2.3. hbase.regionserver.blockCacheFree

內存中的Block cache memory 剩餘 (單位 bytes).

14.4.2.4. hbase.regionserver.blockCacheHitCachingRatio

Block cache hit caching ratio (0 to 100). The cache-hit ratio for reads configured to look in the cache (i.e., cacheBlocks=true).

14.4.2.5. hbase.regionserver.blockCacheHitCount

Number of blocks of StoreFiles (HFiles) read from the cache.

14.4.2.6. hbase.regionserver.blockCacheHitRatio

Block cache 命中率(0 到 100). Includes all read requests, although those with cacheBlocks=false will always read from disk and be counted as a "cache miss"

 

14.4.2.7. hbase.regionserver.blockCacheMissCount

StoreFiles (HFiles)請求的未在緩存中的分塊數量。

14.4.2.8. hbase.regionserver.blockCacheSize

內存中的Block cache 大小 (單位 bytes). i.e., memory in use by the BlockCache

14.4.2.9. hbase.regionserver.compactionQueueSize

compaction隊列的大小. 這個值是需要進行compaction的region數目

14.4.2.10. hbase.regionserver.flushQueueSize

Number of enqueued regions in the MemStore awaiting flush.

14.4.2.11. hbase.regionserver.fsReadLatency_avg_time

文件系統延遲 (ms). 這個值是平均讀HDFS的延遲時間

14.4.2.12. hbase.regionserver.fsReadLatency_num_ops

文件系統讀操作。

14.4.2.13. hbase.regionserver.fsSyncLatency_avg_time

文件系統同步延遲(ms). Latency to sync the write-ahead log records to the filesystem.

14.4.2.14. hbase.regionserver.fsSyncLatency_num_ops

Number of operations to sync the write-ahead log records to the filesystem.

14.4.2.15. hbase.regionserver.fsWriteLatency_avg_time

文件系統寫延遲(ms). Total latency for all writers, including StoreFiles and write-head log.

14.4.2.16. hbase.regionserver.fsWriteLatency_num_ops

Number of filesystem write operations, including StoreFiles and write-ahead log.

14.4.2.17. hbase.regionserver.memstoreSizeMB

所有的RegionServer的memstore大小 (MB)

14.4.2.18. hbase.regionserver.regions

RegionServer服務的regions數量

14.4.2.19. hbase.regionserver.requests

讀寫請求的全部數量。請求是指RegionServer的RPC數量,因此一次Get一個請求,但一個緩存設爲1000的Scan也會在每次調用'next'時導致一個請求。一個批量load是一個Hfile一個請求。

14.4.2.20. hbase.regionserver.storeFileIndexSizeMB

當前RegionServer的storefile索引的總大小(MB)

14.4.2.21. hbase.regionserver.stores

RegionServer打開的stores數量。一個stores對應一個列族。例如,一個包含列族的表有3個region在這個RegionServer上,對應一個 列族就會有3個store.

14.4.2.22. hbase.regionserver.storeFiles

RegionServer打開的存儲文件(HFile)數量。這個值一定大於等於store的數量。

14.5. HBase 監控

14.5.1. 概述

下面的度量方法對每個區域服務器的宏觀監控被證明是是最重要的,特別是在像 OpenTSDB這樣的系統中。如果你的集羣具有性能問題,你可能得參考本組信息。

HBase:

  • Requests
  • Compactions queue

OS:

  • IO Wait
  • User CPU

Java:

  • GC

HBase度量的更多信息,參考 Section 14.4, “HBase Metrics”.

14.5.2. 查詢太慢的日誌

HBase查詢太慢的日誌由可分析的 JSON結構描述。客戶端操作 (Gets, Puts, Deletes, 等)的屬性, 要麼運行太久,或產生輸出太多。“運行太久”和“輸出太多”的門限可配置,如後面所述。輸出產生在主區域服務器日誌中,以便和其他日誌事件一起發現更多細節。它也前置區分標籤(responseTooSlow)(responseTooLarge),(operationTooSlow)(operationTooLarge),以便當用戶只希望看到慢查詢時,用grep過濾。

14.5.2.1. 配置

有兩個配置節可用於調整查詢太慢日誌的門限。

  • hbase.ipc.warn.response.time 不被記錄太慢日誌的查詢執行的最大毫秒數(millisecond) 。缺省10000, 即 10 秒。可設 -1 禁止通過時間長短記入日誌。
  • hbase.ipc.warn.response.size 不被記錄日誌的查詢可返回的最大字節數。缺省 100 MB,可設爲 -1 禁止通過大小記入日誌。

14.5.2.2. 度量

查詢太慢日誌暴露給了JMX 度量。

  • hadoop.regionserver_rpc_slowResponse 是一個全局度量,反射所有超時記入日誌的響應。
  • hadoop.regionserver_rpc_methodName.aboveOneSec 一個度量,反射所有超過一秒時間的響應。

14.5.2.3. 輸出

輸出以操作做標籤,如 (operationTooSlow) 。如果調用是客戶端操作,如 Put, Get, 或 Delete,會暴露詳細指紋信息。否則,標籤爲 (responseTooSlow) ,也同樣提供可分析的JSON 輸出,但具有較少細節信息,完全依賴於RPC自身的超時和超量設置。 TooLarge 代替 TooSlow 如果響應大小引起日誌記錄。在大小和時長都引起日誌記錄時,也是 TooLarge 後置。

14.5.2.4. 示例

2011-09-08 10:01:25,824 WARN org.apache.hadoop.ipc.HBaseServer: (operationTooSlow): {"tables":{"riley2":{"puts":[{"totalColumns":11,"families":{"actions":[{"timestamp":1315501284459,"qualifier":"0","vlen":9667580},{"timestamp":1315501284459,"qualifier":"1","vlen":10122412},{"timestamp":1315501284459,"qualifier":"2","vlen":11104617},{"timestamp":1315501284459,"qualifier":"3","vlen":13430635}]},"row":"cfcd208495d565ef66e7dff9f98764da:0"}],"families":["actions"]}},"processingtimems":956,"client":"10.47.34.63:33623","starttimems":1315501284456,"queuetimems":0,"totalPuts":1,"class":"HRegionServer","responsesize":0,"method":"multiPut"}

注意,在"tables"結構裏的所有東西,是MultiPut的指紋打印的輸出。其餘的信息是RPC相關的,如處理時間和客戶端IP/port。 客戶端的其他操作的模式和通用結構與此相同,但根據單個操作的類型會有一定的不同。如果調用不是客戶端操作,則指紋細節信息將完全沒有。

對本示例而言,指出了緩慢的原因可能是簡單的超大 ( 100MB) multiput,通過 "vlen" 即 value length告知,multiPut中的每個put的域有該信息。

14.6. 集羣複製

參見 集羣複製.

14.7. HBase 備份

有兩種通常策略進行 HBase 備份:停止整個集羣再備份,和在正在使用的集羣上備份。 每一種途徑都有優缺點。

更多信息,參考Sematext的Blog HBase Backup Options.

14.7.1. 全停止備份

一些環境可以容忍暫時停止 HBase 集羣,如用於後臺容量分析,並不提供前臺頁面。 好處是 NameNode/Master 和 RegionServers是停止的, 不會有任何機會丟失正在處理改變的保存文件或元數據。明顯的壞處是集羣被 關閉。步驟包括:

14.7.1.1. 停止 HBase

14.7.1.2. Distcp

Distcp 既用於將HDFS裏面的HBase 目錄下的內容拷貝到當前集羣的另一個目錄,也可以拷貝到另一個集羣。

注意: Distcp 工作的情形是集羣關閉,沒有正在改變的文件。Distcp 不推薦用於正工作着的集羣。

14.7.1.3. 恢復 (如有必要)

通過distcp,從 HDFS備份的數據被拷貝到 '真實' 的hbase 目錄。 複製動作產生新的 HDFS 元數據, 所以並不需要從備份的NameNode 元數據恢復, 因爲是通過distcp從一個特定的 HDFS 目錄 (如, HBase 部分)複製, 不是整個HDFS 文件系統。

14.7.2. 工作集羣備份 - Replication

這種方法假設有另一個集羣。參考HBase的 replication 。

14.7.3. 工作集羣備份 - CopyTable

 14.1.6節, “CopyTable” 工具,即可用於將一個表複製到同集羣的另一個表,也可將表複製到另一個集羣的另一個表。

由於集羣在工作,有丟失正在改變的數據的風險。

14.7.4. 工作集羣備份 - Export

 14.1.7節, “Export” 是一種將 HDFS 內容導出到同一集羣的方法。 恢復數據, 14.1.8節, “Import” 工具可以使用。

由於集羣在工作,有丟失正在改變的數據的風險。

14.8. 容量計劃

14.8.1.存儲

一個常見問題是HBase管理員需要估算一個HBase集羣要用多大存儲量。可以通過幾個方面去考慮,最重要的是集羣要加載什麼數據。 開始於對HBase內部處理數據(KeyValue)的可靠了解。

14.8.1.1. KeyValue

HBase storage will be dominated by KeyValues. 參考 Section 9.7.5.4, “KeyValue” and Section 6.3.2, “Try to minimize row and column sizes” for how HBase stores data internally.

It is critical to understand that there is a KeyValue instance for every attribute stored in a row, and the rowkey-length, ColumnFamily name-length and attribute lengths will drive the size of the database more than any other factor.

14.8.1.2. StoreFiles and Blocks

KeyValue instances are aggregated into blocks, and the blocksize is configurable on a per-ColumnFamily basis. Blocks are aggregated into StoreFile's. 參考 Section 9.7, “Regions”.

14.8.1.3. HDFS Block Replication

Because HBase runs on top of HDFS, factor in HDFS block replication into storage calculations.

14.8.2. 區域

Another common question for HBase administrators is determining the right number of regions per RegionServer. This affects both storage and hardware planning. 參考 Section 11.4.1, “Number of Regions”.

15.2. IDEs

15.2.1. Eclipse

15.2.1.1. Code Formatting

參考 HBASE-3678 Add Eclipse-based Apache Formatter to HBase Wiki for an Eclipse formatter to help ensure your code conforms to HBase'y coding convention. The issue includes instructions for loading the attached formatter.

In addition to the automatic formatting, make sure you follow the style guidelines explained in Section 15.10.5, “Common Patch Feedback”

Also, no @author tags - that's a rule. Quality Javadoc comments are appreciated. And include the Apache license.

15.2.1.2. Subversive Plugin

Download and install the Subversive plugin.

Set up an SVN Repository target from Section 15.1.1, “SVN”, then check out the code.

15.2.1.3. Git Plugin

If you cloned the project via git, download and install the Git plugin (EGit). Attach to your local git repo (via the Git Repositories window) and you'll be able to see file revision history, generate patches, etc.

15.2.1.4. HBase Project Setup in Eclipse

The easiest way is to use the m2eclipse plugin for Eclipse. Eclipse Indigo or newer has m2eclipse built-in, or it can be found here:http://www.eclipse.org/m2e/. M2Eclipse provides Maven integration for Eclipse - it even lets you use the direct Maven commands from within Eclipse to compile and test your project.

To import the project, you merely need to go to File->Import...Maven->Existing Maven Projects and then point Eclipse at the HBase root directory; m2eclipse will automatically find all the hbase modules for you.

If you install m2eclipse and import HBase in your workspace, you will have to fix your eclipse Build Path. Remove target folder, addtarget/generated-jamon and target/generated-sources/java folders. You may also remove from your Build Path the exclusions on the src/main/resources andsrc/test/resources to avoid error message in the console 'Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.6:run (default) on project hbase: 'An Ant BuildException has occured: Replace: source file .../target/classes/hbase-default.xml doesn't exist'. This will also reduce the eclipse build cycles and make your life easier when developing.

15.2.1.5. Import into eclipse with the command line

For those not inclined to use m2eclipse, you can generate the Eclipse files from the command line. First, run (you should only have to do this once):

mvn clean install -DskipTests

and then close Eclipse and execute...

mvn eclipse:eclipse

... from your local HBase project directory in your workspace to generate some new .project and .classpathfiles. Then reopen Eclipse, and import the .project file in the HBase directory to a workspace.

15.2.1.6. Maven Classpath Variable

The M2_REPO classpath variable needs to be set up for the project. This needs to be set to your local Maven repository, which is usually~/.m2/repository

If this classpath variable is not configured, you will see compile errors in Eclipse like this...
Description	Resource	Path	Location	Type The project cannot be built until build path errors are resolved	hbase	 Unknown	Java Problem Unbound classpath variable: 'M2_REPO/asm/asm/3.1/asm-3.1.jar' in project 'hbase'	hbase	 Build path	Build Path Problem Unbound classpath variable: 'M2_REPO/com/github/stephenc/high-scale-lib/high-scale-lib/1.1.1/high-scale-lib-1.1.1.jar' in project 'hbase'	hbase	 Build path	Build Path Problem Unbound classpath variable: 'M2_REPO/com/google/guava/guava/r09/guava-r09.jar' in project 'hbase'	hbase	 Build path	Build Path Problem Unbound classpath variable: 'M2_REPO/com/google/protobuf/protobuf-java/2.3.0/protobuf-java-2.3.0.jar' in project 'hbase'	hbase	 Build path	Build Path Problem Unbound classpath variable:

15.2.1.7. Eclipse Known Issues

Eclipse will currently complain about Bytes.java. It is not possible to turn these errors off.

Description	Resource	Path	Location	Type Access restriction: The method arrayBaseOffset(Class) from the type Unsafe is not accessible due to restriction on required library /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Classes/classes.jar	Bytes.java /hbase/src/main/java/org/apache/hadoop/hbase/util	line 1061	Java Problem Access restriction: The method arrayIndexScale(Class) from the type Unsafe is not accessible due to restriction on required library /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Classes/classes.jar Bytes.java	/hbase/src/main/java/org/apache/hadoop/hbase/util	line 1064	Java Problem Access restriction: The method getLong(Object, long) from the type Unsafe is not accessible due to restriction on required library /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Classes/classes.jar	Bytes.java /hbase/src/main/java/org/apache/hadoop/hbase/util	line 1111	Java Problem

15.2.1.8. Eclipse - More Information

For additional information on setting up Eclipse for HBase development on Windows, see Michael Morello's blog on the topic.

15.3. 創建HBase

This section will be of interest only to those building HBase from source.

15.3.1. Building in snappy compression support

Pass -Dsnappy to trigger the snappy maven profile for building snappy native libs into hbase.

15.3.2. Building the HBase tarball

Do the following to build the HBase tarball. Passing the -Drelease will generate javadoc and run the RAT plugin to verify licenses on source.

% MAVEN_OPTS="-Xmx2g" mvn clean site install assembly:single -Dmaven.test.skip -Prelease

15.3.3. Adding an HBase release to Apache's Maven Repository

Follow the instructions at Publishing Maven Artifacts. The 'trick' to making it all work is answering the questions put to you by the mvn release plugin properly, making sure it is using the actual branch AND before doing the mvn release:perform step, VERY IMPORTANT, check and if necessary hand edit the release.properties file that was put under ${HBASE_HOME} by the previous step, release:perform. You need to edit it to make it point at right locations in SVN.

Use maven 3.0.x.

At the mvn release:perform step, before starting, if you are for example releasing hbase 0.92.0, you need to make sure the pom.xml version is 0.92.0-SNAPSHOT. This needs to be checked in. Since we do the maven release after actual release, I've been doing this checkin into a particular tag rather than into the actual release tag. So, say we released hbase 0.92.0 and now we want to do the release to the maven repository, in svn, the 0.92.0 release will be tagged 0.92.0. Making the maven release, copy the 0.92.0 tag to 0.92.0mvn. Check out this tag and change the version therein and commit.

Here is how I'd answer the questions at release:prepare time:

What is the release version for "HBase"? (org.apache.hbase:hbase) 0.92.0: : What is SCM release tag or label for "HBase"? (org.apache.hbase:hbase) hbase-0.92.0: : 0.92.0mvnrelease What is the new development version for "HBase"? (org.apache.hbase:hbase) 0.92.1-SNAPSHOT: : [INFO] Transforming 'HBase'...

A strange issue I ran into was the one where the upload into the apache repository was being sprayed across multiple apache machines making it so I could not release. 參考 INFRA-4482 Why is my upload to mvn spread across multiple repositories?.

Here is my ~/.m2/settings.xml.

<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd"> <servers> <!- To publish a snapshot of some part of Maven --> <server> <id>apache.snapshots.https</id> <username>YOUR_APACHE_ID </username> <password>YOUR_APACHE_PASSWORD </password> </server> <!-- To publish a website using Maven --> <!-- To stage a release of some part of Maven --> <server> <id>apache.releases.https</id> <username>YOUR_APACHE_ID </username> <password>YOUR_APACHE_PASSWORD </password> </server> </servers> <profiles> <profile> <id>apache-release</id> <properties> <gpg.keyname>YOUR_KEYNAME</gpg.keyname> <!--Keyname is something like this ... 00A5F21E... do gpg --list-keys to find it--> <gpg.passphrase>YOUR_KEY_PASSWORD </gpg.passphrase> </properties> </profile> </profiles> </settings>

When you run release:perform, pass -Papache-release else it will not 'sign' the artifacts it uploads.

If you see run into the below, its because you need to edit version in the pom.xml and add -SNAPSHOT to the version (and commit).

[INFO] Scanning for projects... [INFO] Searching repository for plugin with prefix: 'release'. [INFO] ------------------------------------------------------------------------ [INFO] Building HBase [INFO] task-segment: [release:prepare] (aggregator-style) [INFO] ------------------------------------------------------------------------ [INFO] [release:prepare {execution: default-cli}] [INFO] ------------------------------------------------------------------------ [ERROR] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] You don't have a SNAPSHOT project in the reactor projects list. [INFO] ------------------------------------------------------------------------ [INFO] For more information, run Maven with the -e switch [INFO] ------------------------------------------------------------------------ [INFO] Total time: 3 seconds [INFO] Finished at: Sat Mar 26 18:11:07 PDT 2011 [INFO] Final Memory: 35M/423M [INFO] -----------------------------------------------------------------------

15.3.4. Build Gotchas

If you see Unable to find resource 'VM_global_library.vm', ignore it. Its not an error. It is officially ugly though.

15.4. 發佈 hbase.apache.org 新版

Set up your apache credentials and the target site location locally in a place and form that maven can pick it up, in ~/.m2/settings.xml. See ??? for an example. Next, run the following:

$ mvn -DskipTests -Papache-release site site:deploy

You will be asked for your password. It can take a little time. Remember that it can take a few hours for your site changes to show up.

15.6. 測試

Developers, at a minimum, should familiarize themselves with the unit test detail; unit tests in HBase have a character not usually seen in other projects.

15.6.1. HBase 模塊

As of 0.96, HBase is split into multiple modules which creates "interesting" rules for how and where tests are written. If you are writting code for hbase-server, see Section 15.6.2, “Unit Tests” for how to write your tests; these tests can spin up a minicluster and will need to be categorized. For any other module, for example hbase-common, the tests must be strict unit tests and just test the class under test - no use of the HBaseTestingUtility or minicluster is allowed (or even possible given the dependency tree).

15.6.1.1. 在其他模塊中運行測試

If the module you are developing in has no other dependencies on other HBase modules, then you can cd into that module and just run:
mvn test
which will just run the tests IN THAT MODULE. If there are other dependencies on other modules, then you will have run the command from the ROOT HBASE DIRECTORY. This will run the tests in the other modules, unless you specify to skip the tests in that module. For instance, to skip the tests in the hbase-server module, you would run:
mvn clean test -Dskip-server-tests
from the top level directory to run all the tests in modules other than hbase-server. Note that you can specify to skip tests in multiple modules as well as just for a single module. For example, to skip the tests in hbase-server and hbase-common, you would run:
mvn clean test -Dskip-server-tests -Dskip-common-tests

Also, keep in mind that if you are running tests in the hbase-server module you will need to apply the maven profiles discussed inSection 15.6.2.4, “Running tests” to get the tests to run properly.

15.6.2. 單元測試

HBase unit tests are subdivided into three categories: small, medium and large, with corresponding JUnit categoriesSmallTestsMediumTests,LargeTests. JUnit categories are denoted using java annotations and look like this in your unit test code.

... @Category(SmallTests.class) public class TestHRegionInfo { @Test public void testCreateHRegionInfoName() throws Exception { // ... } ... @org.junit.Rule public org.apache.hadoop.hbase.ResourceCheckerJUnitRule cu = new org.apache.hadoop.hbase.ResourceCheckerJUnitRule(); }

The above example shows how to mark a test as belonging to the small category. The @org.junit.Rule lines on the end are also necessary. Add them to each new unit test file. They are needed by the categorization process. HBase uses a patched maven surefire plugin and maven profiles to implement its unit test characterizations.

15.6.2.1. Small Tests

Small tests are executed in a shared JVM. We put in this category all the tests that can be executed quickly in a shared JVM. The maximum execution time for a small test is 15 seconds, and small tests should not use a (mini)cluster.

15.6.2.2. Medium Tests

Medium tests represent tests that must be executed before proposing a patch. They are designed to run in less than 30 minutes altogether, and are quite stable in their results. They are designed to last less than 50 seconds individually. They can use a cluster, and each of them is executed in a separate JVM.

15.6.2.3. Large Tests

Large tests are everything else. They are typically integration-like tests, regression tests for specific bugs, timeout tests, performance tests. They are executed before a commit on the pre-integration machines. They can be run on the developer machine as well.

15.6.2.4. Running tests

Below we describe how to run the HBase junit categories.

15.6.2.4.1. Default: small and medium category tests

Running

mvn test

will execute all small tests in a single JVM (no fork) and then medium tests in a separate JVM for each test instance. Medium tests are NOT executed if there is an error in a small test. Large tests are NOT executed. There is one report for small tests, and one report for medium tests if they are executed. To run small and medium tests with the security profile enabled, do

mvn test -P security

15.6.2.4.2. Running all tests

Running

mvn test -P runAllTests

will execute small tests in a single JVM then medium and large tests in a separate JVM for each test. Medium and large tests are NOT executed if there is an error in a small test. Large tests are NOT executed if there is an error in a small or medium test. There is one report for small tests, and one report for medium and large tests if they are executed

15.6.2.4.3. Running a single test or all tests in a package

To run an individual test, e.g. MyTest, do

mvn test -P localTests -Dtest=MyTest

You can also pass multiple, individual tests as a comma-delimited list:

mvn test -P localTests -Dtest=MyTest1,MyTest2,MyTest3

You can also pass a package, which will run all tests under the package:

mvn test -P localTests -Dtest=org.apache.hadoop.hbase.client.*

To run a single test with the security profile enabled:

mvn test -P security,localTests -Dtest=TestGet

The -P localTests will remove the JUnit category effect (without this specific profile, the categories are taken into account). It will actually use the official release of surefire and the old connector (The HBase build uses a patched version of the maven surefire plugin). Each junit tests is executed in a separate JVM (A fork per test class). There is no parallelization when localTests profile is set. You will see a new message at the end of the report: "[INFO] Tests are skipped". It's harmless.

15.6.2.4.4. Other test invocation permutations

Running

mvn test -P runSmallTests

will execute small tests only, in a single JVM.

Running

mvn test -P runMediumTests

will execute medium tests in a single JVM.

Running

mvn test -P runLargeTests

execute medium tests in a single JVM.

15.6.2.4.5. hbasetests.sh

It's also possible to use the script hbasetests.sh. This script runs the medium and large tests in parallel with two maven instances, and provides a single report. This script does not use the hbase version of surefire so no parallelization is being done other than the two maven instances the script sets up. It must be executed from the directory which contains the pom.xml.

For example running

./dev-support/hbasetests.sh

will execute small and medium tests. Running

./dev-support/hbasetests.sh runAllTests

will execute all tests. Running

./dev-support/hbasetests.sh replayFailed

will rerun the failed tests a second time, in a separate jvm and without parallelisation.

15.6.2.5. Writing Tests

15.6.2.5.1. General rules
  • As much as possible, tests should be written as category small tests.
  • All tests must be written to support parallel execution on the same machine, hence they should not use shared resources as fixed ports or fixed file names.
  • Tests should not overlog. More than 100 lines/second makes the logs complex to read and use i/o that are hence not available for the other tests.
  • Tests can be written with HBaseTestingUtility. This class offers helper functions to create a temp directory and do the cleanup, or to start a cluster. Categories and execution time
  • All tests must be categorized, if not they could be skipped.
  • All tests should be written to be as fast as possible.
  • Small category tests should last less than 15 seconds, and must not have any side effect.
  • Medium category tests should last less than 50 seconds.
  • Large category tests should last less than 3 minutes. This should ensure a good parallelization for people using it, and ease the analysis when the test fails.
15.6.2.5.2. Sleeps in tests

Whenever possible, tests should not use Thread.sleep, but rather waiting for the real event they need. This is faster and clearer for the reader. Tests should not do a Thread.sleep without testing an ending condition. This allows understanding what the test is waiting for. Moreover, the test will work whatever the machine performance is. Sleep should be minimal to be as fast as possible. Waiting for a variable should be done in a 40ms sleep loop. Waiting for a socket operation should be done in a 200 ms sleep loop.

15.6.2.5.3. Tests using a cluster

Tests using a HRegion do not have to start a cluster: A region can use the local file system. Start/stopping a cluster cost around 10 seconds. They should not be started per test method but per test class. Started cluster must be shutdown using HBaseTestingUtility#shutdownMiniCluster, which cleans the directories. As most as possible, tests should use the default settings for the cluster. When they don't, they should document it. This will allow to share the cluster later.

15.7. Maven Build Commands

All commands executed from the local HBase project directory.

Note: use Maven 3 (Maven 2 may work but we suggest you use Maven 3).

15.7.1. Compile

mvn compile

15.7.2. Running all or individual Unit Tests

參考 the Section 15.6.2.4, “Running tests” section above in Section 15.6.2, “Unit Tests”

15.7.3. Building against various hadoop versions.

As of 0.96, HBase supports building against hadoop versions: 1.0.3, 2.0.0-alpha and 3.0.0-SNAPSHOT. By default, we will build with Hadoop-1.0.3. To change the version to run with Hadoop-2.0.0-alpha, you would run:

mvn -Dhadoop.profile=2.0 ...

That is, designate build with hadoop.profile 2.0. Pass 2.0 for hadoop.profile to build against hadoop 2.0. Tests may not all pass as of this writing so you may need to pass -DskipTests unless you are inclined to fix the failing tests.

Similarly, for 3.0, you would just replace the profile value. Note that Hadoop-3.0.0-SNAPSHOT does not currently have a deployed maven artificat - you will need to build and install your own in your local maven repository if you want to run against this profile.

In earilier verions of HBase, you can build against older versions of hadoop, notably, Hadoop 0.22.x and 0.23.x. If you are running, for example HBase-0.94 and wanted to build against Hadoop 0.23.x, you would run with:

mvn -Dhadoop.profile=22 ...

15.8. Getting Involved

HBase gets better only when people contribute!

As HBase is an Apache Software Foundation project, see Appendix H, HBase and the Apache Software Foundation for more information about how the ASF functions.

15.8.1. Mailing Lists

Sign up for the dev-list and the user-list. 參考 the mailing lists page. Posing questions - and helping to answer other people's questions - is encouraged! There are varying levels of experience on both lists so patience and politeness are encouraged (and please stay on topic.)

15.8.2. Jira

Check for existing issues in Jira. If it's either a new feature request, enhancement, or a bug, file a ticket.

15.8.2.1. Jira Priorities

The following is a guideline on setting Jira issue priorities:

  • Blocker: Should only be used if the issue WILL cause data loss or cluster instability reliably.
  • Critical: The issue described can cause data loss or cluster instability in some cases.
  • Major: Important but not tragic issues, like updates to the client API that will add a lot of much-needed functionality or significant bugs that need to be fixed but that don't cause data loss.
  • Minor: Useful enhancements and annoying but not damaging bugs.
  • Trivial: Useful enhancements but generally cosmetic.

15.8.2.2. Code Blocks in Jira Comments

A commonly used macro in Jira is {code}. If you do this in a Jira comment...

{code} code snippet {code}

... Jira will format the code snippet like code, instead of a regular comment. It improves readability.

15.9. Developing

15.9.1. Codelines

Most development is done on TRUNK. However, there are branches for minor releases (e.g., 0.90.1, 0.90.2, and 0.90.3 are on the 0.90 branch).

If you have any questions on this just send an email to the dev dist-list.

15.9.2. Unit Tests

In HBase we use JUnit 4. If you need to run miniclusters of HDFS, ZooKeeper, HBase, or MapReduce testing, be sure to checkout theHBaseTestingUtility. Alex Baranau of Sematext describes how it can be used in HBase Case-Study: Using HBaseTestingUtility for Local Testing and Development (2010).

15.9.2.1. Mockito

Sometimes you don't need a full running server unit testing. For example, some methods can make do with a a org.apache.hadoop.hbase.Server instance or a org.apache.hadoop.hbase.master.MasterServices Interface reference rather than a full-blown org.apache.hadoop.hbase.master.HMaster. In these cases, you maybe able to get away with a mocked Server instance. For example:

TODO...

15.9.3. Code Standards

參考 Section 15.2.1.1, “Code Formatting” and Section 15.10.5, “Common Patch Feedback”.

Also, please pay attention to the interface stability/audience classifications that you will see all over our code base. They look like this at the head of the class:

@InterfaceAudience.Public @InterfaceStability.Stable

If the InterfaceAudience is Private, we can change the class (and we do not need to include a InterfaceStability mark). If a class is marked Public but its InterfaceStability is marked Unstable, we can change it. If it's marked Public/Evolving, we're allowed to change it but should try not to. If it'sPublic and Stable we can't change it without a deprecation path or with a really GREAT reason.

When you add new classes, mark them with the annotations above if publically accessible. If you are not cleared on how to mark your additions, ask up on the dev list.

This convention comes from our parent project Hadoop.

15.9.4. Running In-Situ

If you are developing HBase, frequently it is useful to test your changes against a more-real cluster than what you find in unit tests. In this case, HBase can be run directly from the source in local-mode. All you need to do is run:

${HBASE_HOME}/bin/start-hbase.sh

This will spin up a full local-cluster, just as if you had packaged up HBase and installed it on your machine.

Keep in mind that you will need to have installed HBase into your local maven repository for the in-situ cluster to work properly. That is, you will need to run:

mvn clean install -DskipTests

to ensure that maven can find the correct classpath and dependencies. Generally, the above command is just a good thing to try running first, if maven is acting oddly.

15.10. Submitting Patches

15.10.1. Create Patch

Patch files can be easily generated from Eclipse, for example by selecting "Team -> Create Patch". Patches can also be created by git diff and svn diff.

Please submit one patch-file per Jira. For example, if multiple files are changed make sure the selected resource when generating the patch is a directory. Patch files can reflect changes in multiple files.

Make sure you review Section 15.2.1.1, “Code Formatting” for code style.

15.10.2. Patch File Naming

The patch file should have the HBase Jira ticket in the name. For example, if a patch was submitted for Foo.java, then a patch file calledFoo_HBASE_XXXX.patch would be acceptable where XXXX is the HBase Jira number.

If you generating from a branch, then including the target branch in the filename is advised, e.g., HBASE-XXXX-0.90.patch.

15.10.3. Unit Tests

Yes, please. Please try to include unit tests with every code patch (and especially new classes and large changes). Make sure unit tests pass locally before submitting the patch.

Also, see Section 15.9.2.1, “Mockito”.

If you are creating a new unit test class, notice how other unit test classes have classification/sizing annotations at the top and a static method on the end. Be sure to include these in any new unit test files you generate. 參考 Section 15.5, “Tests” for more on how the annotations work.

15.10.4. Attach Patch to Jira

The patch should be attached to the associated Jira ticket "More Actions -> Attach Files". Make sure you click the ASF license inclusion, otherwise the patch can't be considered for inclusion.

Once attached to the ticket, click "Submit Patch" and the status of the ticket will change. Committers will review submitted patches for inclusion into the codebase. Please understand that not every patch may get committed, and that feedback will likely be provided on the patch. Fear not, though, because the HBase community is helpful!

15.10.5. Common Patch Feedback

The following items are representative of common patch feedback. Your patch process will go faster if these are taken into account beforesubmission.

參考 the Java coding standards for more information on coding conventions in Java.

15.10.5.1. Space Invaders

Rather than do this...

if ( foo.equals( bar ) ) { // don't do this

... do this instead...

if (foo.equals(bar)) {

Also, rather than do this...

foo = barArray[ i ]; // don't do this

... do this instead...

foo = barArray[i];

15.10.5.2. Auto Generated Code

Auto-generated code in Eclipse often looks like this...

public void readFields(DataInput arg0) throws IOException { // don't do this foo = arg0.readUTF(); // don't do this

... do this instead ...

public void readFields(DataInput di) throws IOException { foo = di.readUTF();

參考 the difference? 'arg0' is what Eclipse uses for arguments by default.

15.10.5.3. Long Lines

Keep lines less than 100 characters.

Bar bar = foo.veryLongMethodWithManyArguments(argument1, argument2, argument3, argument4, argument5, argument6, argument7, argument8, argument9); // don't do this

... do something like this instead ...

Bar bar = foo.veryLongMethodWithManyArguments( argument1, argument2, argument3,argument4, argument5, argument6, argument7, argument8, argument9);

15.10.5.4. Trailing Spaces

This happens more than people would imagine.

Bar bar = foo.getBar(); <--- imagine there's an extra space(s) after the semicolon instead of a line break.

Make sure there's a line-break after the end of your code, and also avoid lines that have nothing but whitespace.

15.10.5.5. Implementing Writable

Every class returned by RegionServers must implement Writable. If you are creating a new class that needs to implement this interface, don't forget the default constructor.

15.10.5.6. Javadoc

This is also a very common feedback item. Don't forget Javadoc!

15.10.5.7. Javadoc - Useless Defaults

Don't just leave the @param arguments the way your IDE generated them. Don't do this...

/** * * @param bar <---- don't do this!!!! * @return <---- or this!!!! */ public Foo getFoo(Bar bar);

... either add something descriptive to the @param and @return lines, or just remove them. But the preference is to add something descriptive and useful.

15.10.5.8. One Thing At A Time, Folks

If you submit a patch for one thing, don't do auto-reformatting or unrelated reformatting of code on a completely different area of code.

Likewise, don't add unrelated cleanup or refactorings outside the scope of your Jira.

15.10.5.9. Ambigious Unit Tests

Make sure that you're clear about what you are testing in your unit tests and why.

15.10.6. ReviewBoard

Larger patches should go through ReviewBoard.

For more information on how to use ReviewBoard, see the ReviewBoard documentation.

15.10.7. Committing Patches

Committers do this. 參考 How To Commit in the HBase wiki.

Commiters will also resolve the Jira, typically after the patch passes a build.

16. ZooKeeper

一個分佈式運行的HBase依賴一個zookeeper集羣。所有的節點和客戶端都必須能夠訪問zookeeper。默認的情況下HBase會管理一個zookeep集羣。這個集羣會隨着HBase的啓動而啓動。當然,你也可以自己管理一個zookeeper集羣,但需要配置HBase。你需要修改conf/hbase-env.sh裏面的HBASE_MANAGES_ZK 來切換。這個值默認是true的,作用是讓HBase啓動的時候同時也啓動zookeeper.

當HBase管理zookeeper的時候,你可以通過修改zoo.cfg來配置zookeeper,一個更加簡單的方法是在 conf/hbase-site.xml裏面修改zookeeper的配置。Zookeep的配置是作爲property寫在 hbase-site.xml裏面的。option的名字是 hbase.zookeeper.property. 打個比方, clientPort 配置在xml裏面的名字是hbase.zookeeper.property.clientPort. 所有的默認值都是HBase決定的,包括zookeeper, 參見 Section 2.6.1.1, “HBase 默認配置”. 可以查找hbase.zookeeper.property 前綴,找到關於zookeeper的配置。 [13]

對於zookeepr的配置,你至少要在 hbase-site.xml中列出zookeepr的ensemble servers,具體的字段是 hbase.zookeeper.quorum. 該這個字段的默認值是 localhost,這個值對於分佈式應用顯然是不可以的. (遠程連接無法使用).

我需要運行幾個zookeeper?

你運行一個zookeeper也是可以的,但是在生產環境中,你最好部署3,5,7個節點。部署的越多,可靠性就越高,當然只能部署奇數個,偶數個是不可以的。你需要給每個zookeeper 1G左右的內存,如果可能的話,最好有獨立的磁盤。 (獨立磁盤可以確保zookeeper是高性能的。).如果你的集羣負載很重,不要把Zookeeper和RegionServer運行在同一臺機器上面。就像DataNodes 和 TaskTrackers一樣

舉個例子,HBase管理着的ZooKeeper集羣在節點 rs{1,2,3,4,5}.example.com, 監聽2222 端口(默認是2181),並確保conf/hbase-env.sh文件中 HBASE_MANAGE_ZK的值是true ,再編輯 conf/hbase-site.xml 設置 hbase.zookeeper.property.clientPort 和 hbase.zookeeper.quorum。你還可以設置 hbase.zookeeper.property.dataDir屬性來把ZooKeeper保存數據的目錄地址改掉。默認值是 /tmp ,這裏在重啓的時候會被操作系統刪掉,可以把它修改到 /user/local/zookeeper.

<configuration> ... <property> <name>hbase.zookeeper.property.clientPort</name> <value>2222</value> <description>Property from ZooKeeper's config zoo.cfg. The port at which the clients will connect. </description> </property> <property> <name>hbase.zookeeper.quorum</name> <value>rs1.example.com,rs2.example.com,rs3.example.com,rs4.example.com,rs5.example.com</value> <description>Comma separated list of servers in the ZooKeeper Quorum. For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com". By default this is set to localhost for local and pseudo-distributed modes of operation. For a fully-distributed setup, this should be set to a full list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in hbase-env.sh this is the list of servers which we will start/stop ZooKeeper on. </description> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>/usr/local/zookeeper</value> <description>Property from ZooKeeper's config zoo.cfg. The directory where the snapshot is stored. </description> </property> ... </configuration>

16.1. 使用現有的ZooKeeper例子

讓HBase使用一個現有的不被HBase託管的Zookeep集羣,需要設置 conf/hbase-env.sh文件中的HBASE_MANAGES_ZK 屬性爲 false

... # Tell HBase whether it should manage it's own instance of Zookeeper or not. export HBASE_MANAGES_ZK=false

接下來,指明Zookeeper的host和端口。可以在 hbase-site.xml中設置, 也可以在HBase的CLASSPATH下面加一個zoo.cfg配置文件。 HBase 會優先加載 zoo.cfg 裏面的配置,把hbase-site.xml裏面的覆蓋掉.

當HBase託管ZooKeeper的時候,Zookeeper集羣的啓動是HBase啓動腳本的一部分。但現在,你需要自己去運行。你可以這樣做

${HBASE_HOME}/bin/hbase-daemons.sh {start,stop} zookeeper

你可以用這條命令啓動ZooKeeper而不啓動HBase. HBASE_MANAGES_ZK 的值是 false, 如果你想在HBase重啓的時候不重啓ZooKeeper,你可以這樣做

對於獨立Zoopkeeper的問題,你可以在 Zookeeper啓動得到幫助.

16.2.  通過ZooKeeper 的SASL 認證

新版 HBase (>= 0.92)將支持連接到 ZooKeeper Quorum 進行SASL 認證。 ( Zookeeper 3.4.0 以上可用).

這裏描述如何設置 HBase,以同 ZooKeeper Quorum實現互相認證. ZooKeeper/HBase 互相認證 (HBASE-2418) 是 HBase安全配置所必不可少的一部分 (HBASE-3025)。 爲簡化說明, 本節忽略所必需的額外配置 ( HDFS 安全和 Coprocessor 配置)。 推薦使用HBase內置 Zookeeper 配置 (相對獨立 Zookeeper quorum) 以簡化學習.

16.2.1. 操作系統預置

需要一個工作 Kerberos KDC 配置. 每個 $HOST 運行一個 ZooKeeper 服務器, 應該有個主要的 zookeeper/$HOST. 對每個主機,爲 zookeeper/$HOST添加一個 key (使用kadmin 或 kadmin.local 工具的 ktadd 命令) ,將該key文件複製到 $HOST, 並設置僅對該 $HOST上運行 zookeeper 的用戶只讀。注意文件位置, 我們將在下面以$PATH_TO_ZOOKEEPER_KEYTAB 使用。

Similarly, for each $HOST that will run an HBase server (master or regionserver), you should have a principle: hbase/$HOST. For each host, add a keytab file called hbase.keytab containing a service key for hbase/$HOST, copy this file to $HOST, and make it readable only to the user that will run an HBase service on $HOST. Note the location of this file, which we will use below as $PATH_TO_HBASE_KEYTAB.

Each user who will be an HBase client should also be given a Kerberos principal. This principal should usually have a password assigned to it (as opposed to, as with the HBase servers, a keytab file) which only this user knows. The client's principal's maxrenewlife should be set so that it can be renewed enough so that the user can complete their HBase client processes. For example, if a user runs a long-running HBase client process that takes at most 3 days, we might create this user's principal within kadmin with: addprinc -maxrenewlife 3days. The Zookeeper client and server libraries manage their own ticket refreshment by running threads that wake up periodically to do the refreshment.

On each host that will run an HBase client (e.g. hbase shell), add the following file to the HBase home directory's conf directory:

Client { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=false useTicketCache=true; };

We'll refer to this JAAS configuration file as $CLIENT_CONF below.

16.2.2. HBase內置的 Zookeeper 配置

每個節點要運行一個 zookeeper, 一個主服務, 或一個 regionserver,在HBASE_HOME的conf目錄中創建如下所示的 JAAS 配置文件:

Server { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="$PATH_TO_ZOOKEEPER_KEYTAB" storeKey=true useTicketCache=false principal="zookeeper/$HOST"; }; Client { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true useTicketCache=false keyTab="$PATH_TO_HBASE_KEYTAB" principal="hbase/$HOST"; };
$PATH_TO_HBASE_KEYTAB 和 $PATH_TO_ZOOKEEPER_KEYTAB文件是上面創建的。 $HOST 該節點主機名.

Server 由 Zookeeper quorum 服務器使用, Client 節由 HBase master 和 regionserver使用. The path to this file should be substituted for the text$HBASE_SERVER_CONF in the hbase-env.sh listing below.

The path to this file should be substituted for the text $CLIENT_CONF in the hbase-env.sh listing below.

Modify your hbase-env.sh to include the following:

export HBASE_OPTS="-Djava.security.auth.login.config=$CLIENT_CONF" export HBASE_MANAGES_ZK=true export HBASE_ZOOKEEPER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF" export HBASE_MASTER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF" export HBASE_REGIONSERVER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF"
where $HBASE_SERVER_CONF and $CLIENT_CONF are the full paths to the JAAS configuration files created above.

Modify your hbase-site.xml on each node that will run zookeeper, master or regionserver to contain:

<configuration> <property> <name>hbase.zookeeper.quorum</name> <value>$ZK_NODES</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> <property> <name>hbase.zookeeper.property.authProvider.1</name> <value>org.apache.zookeeper.server.auth.SASLAuthenticationProvider</value> </property> <property> <name>hbase.zookeeper.property.kerberos.removeHostFromPrincipal</name> <value>true</value> </property> <property> <name>hbase.zookeeper.property.kerberos.removeRealmFromPrincipal</name> <value>true</value> </property> </configuration>

where $ZK_NODES is the comma-separated list of hostnames of the Zookeeper Quorum hosts.

Start your hbase cluster by running one or more of the following set of commands on the appropriate hosts:

bin/hbase zookeeper start bin/hbase master start bin/hbase regionserver start

16.2.3. 外部 Zookeeper 配置

增加 JAAS 配置文件:

Client { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true useTicketCache=false keyTab="$PATH_TO_HBASE_KEYTAB" principal="hbase/$HOST"; };

$PATH_TO_HBASE_KEYTAB 是上面創建的keytab,以便 HBase 服務可以在本主機運行 , $HOST 是該節點的 hostname . 將該配置放到 HBase home配置目錄.在下面的$HBASE_SERVER_CONF 進行引用.

修改 hbase-env.sh 增加如下項:

export HBASE_OPTS="-Djava.security.auth.login.config=$CLIENT_CONF" export HBASE_MANAGES_ZK=false export HBASE_MASTER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF" export HBASE_REGIONSERVER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF"

修改每個節點的 hbase-site.xml 包含下面內容:

<configuration> <property> <name>hbase.zookeeper.quorum</name> <value>$ZK_NODES</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> </configuration>

$ZK_NODES是逗號分隔的Zookeeper Quorum主機名列表。

每個Zookeeper Quorum節點增加 zoo.cfg 包含下列內容:

authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider kerberos.removeHostFromPrincipal=true kerberos.removeRealmFromPrincipal=true

在每個主機創建 JAAS 配置幷包含:

Server { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="$PATH_TO_ZOOKEEPER_KEYTAB" storeKey=true useTicketCache=false principal="zookeeper/$HOST"; };

$HOST 每個Quorum 的主機名. 我們會在下面的$ZK_SERVER_CONF引用本文件的全路徑名稱.

在每個 Zookeeper Quorum主機啓動 Zookeeper:

SERVER_JVMFLAGS="-Djava.security.auth.login.config=$ZK_SERVER_CONF" bin/zkServer start

啓動HBase 集羣。在適當的節點運行下面的一到多個命令:

bin/hbase master start bin/hbase regionserver start

16.2.4. Zookeeper 服務端認證 Log

如果上面配置成功, 你應該可以看到如下所示的Zookeeper 服務器 log:

11/12/05 22:43:39 INFO zookeeper.Login: successfully logged in. 11/12/05 22:43:39 INFO server.NIOServerCnxnFactory: binding to port 0.0.0.0/0.0.0.0:2181 11/12/05 22:43:39 INFO zookeeper.Login: TGT refresh thread started. 11/12/05 22:43:39 INFO zookeeper.Login: TGT valid starting at: Mon Dec 05 22:43:39 UTC 2011 11/12/05 22:43:39 INFO zookeeper.Login: TGT expires: Tue Dec 06 22:43:39 UTC 2011 11/12/05 22:43:39 INFO zookeeper.Login: TGT refresh sleeping until: Tue Dec 06 18:36:42 UTC 2011 .. 11/12/05 22:43:59 INFO auth.SaslServerCallbackHandler: Successfully authenticated client: authenticationID=hbase/[email protected]; authorizationID=hbase/[email protected]. 11/12/05 22:43:59 INFO auth.SaslServerCallbackHandler: Setting authorizedID: hbase 11/12/05 22:43:59 INFO server.ZooKeeperServer: adding SASL authorization for authorizationID: hbase

16.2.5. Zookeeper 客戶端認證 Log

Zookeeper 客戶端側 (HBase master 或 regionserver), 應該可以看到如下所示的東西:

11/12/05 22:43:59 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=ip-10-166-175-249.us-west-1.compute.internal:2181 sessionTimeout=180000 watcher=master:60000 11/12/05 22:43:59 INFO zookeeper.ClientCnxn: Opening socket connection to server /10.166.175.249:2181 11/12/05 22:43:59 INFO zookeeper.RecoverableZooKeeper: The identifier of this process is 14851@ip-10-166-175-249 11/12/05 22:43:59 INFO zookeeper.Login: successfully logged in. 11/12/05 22:43:59 INFO client.ZooKeeperSaslClient: Client will use GSSAPI as SASL mechanism. 11/12/05 22:43:59 INFO zookeeper.Login: TGT refresh thread started. 11/12/05 22:43:59 INFO zookeeper.ClientCnxn: Socket connection established to ip-10-166-175-249.us-west-1.compute.internal/10.166.175.249:2181, initiating session 11/12/05 22:43:59 INFO zookeeper.Login: TGT valid starting at: Mon Dec 05 22:43:59 UTC 2011 11/12/05 22:43:59 INFO zookeeper.Login: TGT expires: Tue Dec 06 22:43:59 UTC 2011 11/12/05 22:43:59 INFO zookeeper.Login: TGT refresh sleeping until: Tue Dec 06 18:30:37 UTC 2011 11/12/05 22:43:59 INFO zookeeper.ClientCnxn: Session establishment complete on server ip-10-166-175-249.us-west-1.compute.internal/10.166.175.249:2181, sessionid = 0x134106594320000, negotiated timeout = 180000

16.2.6. 從頭開始配置

在當前標準的 Amazon Linux AMI上測試通過. 先按上面的描述配置 KDC 和 主要節點. 然後獲取代碼並執行檢測.
git clone git://git.apache.org/hbase.git cd hbase mvn -Psecurity,localTests clean test -Dtest=TestZooKeeperACL
再按上面描述配置 HBase. 手工編輯target/cached_classpath.txt (如下)..
bin/hbase zookeeper & bin/hbase master & bin/hbase regionserver &

16.2.7. 未來改進

16.2.7.1. 完善 target/cached_classpath.txt

必須在 target/cached_classpath.txt 重寫標準 hadoop-core jar 文件爲包含 HADOOP-7070 修改的版本。可以使用下列腳本完成:

echo `find ~/.m2 -name "*hadoop-core*7070*SNAPSHOT.jar"` ':' `cat target/cached_classpath.txt` | sed 's/ //g' > target/tmp.txt mv target/tmp.txt target/cached_classpath.txt

16.2.7.2. 用程序配置 JAAS

可以避免分離的 Hadoop 修復 HADOOP-7070 的jar文件.

16.2.7.3. 排除 kerberos.removeHostFromPrincipal 和 kerberos.removeRealmFromPrincipal

[33] For the full list of ZooKeeper configurations, see ZooKeeper's zoo.cfg. HBase does not ship with a zoo.cfg so you will need to browse the conf directory in an appropriate ZooKeeper download.

Chapter 17. Community

17.1. Decisions

17.1.1. Feature Branches

Feature Branches are easy to make. You do not have to be a committer to make one. Just request the name of your branch be added to JIRA up on the developer's mailing list and a committer will add it for you. Thereafter you can file issues against your feature branch in Apache HBase (TM) JIRA. Your code you keep elsewhere -- it should be public so it can be observed -- and you can update dev mailing list on progress. When the feature is ready for commit, 3 +1s from committers will get your feature merged[34]

17.1.2. Patch +1 Policy

The below policy is something we put in place 09/2012. It is a suggested policy rather than a hard requirement. We want to try it first to see if it works before we cast it in stone.

Apache HBase is made of components. Components have one or more Section 17.2.1, “Component Owner”s. See the 'Description' field on the components JIRA page for who the current owners are by component.

Patches that fit within the scope of a single Apache HBase component require, at least, a +1 by one of the component's owners before commit. If owners are absent -- busy or otherwise -- two +1s by non-owners will suffice.

Patches that span components need at least two +1s before they can be committed, preferably +1s by owners of components touched by the x-component patch (TODO: This needs tightening up but I think fine for first pass).

Any -1 on a patch by anyone vetos a patch; it cannot be committed until the justification for the -1 is addressed.

17.2. Community Roles

17.2.1. Component Owner

Component owners are listed in the description field on this Apache HBase JIRA components page. The owners are listed in the 'Description' field rather than in the 'Component Lead' field because the latter only allows us list one individual whereas it is encouraged that components have multiple owners.

Owners are volunteers who are (usually, but not necessarily) expert in their component domain and may have an agenda on how they think their Apache HBase component should evolve.

Duties include:

  1. Owners will try and review patches that land within their component's scope.

  2. If applicable, if an owner has an agenda, they will publish their goals or the design toward which they are driving their component

If you would like to be volunteer as a component owner, just write the dev list and we'll sign you up. Owners do not need to be committers.

 

Appendix A. FAQ

A.1. General
When should I use HBase?
Are there other HBase FAQs?
Does HBase support SQL?
How can I find examples of NoSQL/HBase?
What is the history of HBase?
A.2. Architecture
How does HBase handle Region-RegionServer assignment and locality?
A.3. Configuration
How can I get started with my first cluster?
Where can I learn about the rest of the configuration options?
A.4. Schema Design / Data Access
How should I design my schema in HBase?
How can I store (fill in the blank) in HBase?
How can I handle secondary indexes in HBase?
Can I change a table's rowkeys?
What APIs does HBase support?
A.5. MapReduce
How can I use MapReduce with HBase?
A.6. Performance and Troubleshooting
How can I improve HBase cluster performance?
How can I troubleshoot my HBase cluster?
A.7. Amazon EC2
I am running HBase on Amazon EC2 and...
A.8. Operations
How do I manage my HBase cluster?
How do I back up my HBase cluster?
A.9. HBase in Action
Where can I find interesting videos and presentations on HBase?

A.1. General

When should I use HBase?
Are there other HBase FAQs?
Does HBase support SQL?
How can I find examples of NoSQL/HBase?
What is the history of HBase?

When should I use HBase?

 

參考 the Section 9.1, “概述” in the Architecture chapter.

Are there other HBase FAQs?

 

參考 the FAQ that is up on the wiki, HBase Wiki FAQ.

Does HBase support SQL?

 

Not really. SQL-ish support for HBase via Hive is in development, however Hive is based on MapReduce which is not generally suitable for low-latency requests. 參考 the Chapter 5, Data Model section for examples on the HBase client.

How can I find examples of NoSQL/HBase?

 

參考 the link to the BigTable paper in Appendix F, Other Information About HBase in the appendix, as well as the other papers.

What is the history of HBase?

 

參考 Appendix G, HBase History.

A.2. Architecture

How does HBase handle Region-RegionServer assignment and locality?

How does HBase handle Region-RegionServer assignment and locality?

 

參考 Section 9.7, “Regions”.

A.3. Configuration

How can I get started with my first cluster?
Where can I learn about the rest of the configuration options?

How can I get started with my first cluster?

 

參考 Section 1.2, “Quick Start”.

Where can I learn about the rest of the configuration options?

 

參考 Chapter 2, Configuration.

A.4. Schema Design / Data Access

How should I design my schema in HBase?
How can I store (fill in the blank) in HBase?
How can I handle secondary indexes in HBase?
Can I change a table's rowkeys?
What APIs does HBase support?

How should I design my schema in HBase?

 

參考 Chapter 5, Data Model and Chapter 6, HBase and Schema Design

How can I store (fill in the blank) in HBase?

 

參考 Section 6.5, “ Supported Datatypes ”.

How can I handle secondary indexes in HBase?

 

參考 Section 6.9, “ Secondary Indexes and Alternate Query Paths ”

Can I change a table's rowkeys?

 

This is a very common quesiton. You can't. 參考 Section 6.3.5, “Immutability of Rowkeys”.

What APIs does HBase support?

 

參考 Chapter 5, Data ModelSection 9.3, “Client” and Section 10.1, “非Java 語言和 JVM 通話”.

A.5. MapReduce

How can I use MapReduce with HBase?

How can I use MapReduce with HBase?

 

參考 Chapter 7, HBase and MapReduce

A.6. Performance and Troubleshooting

How can I improve HBase cluster performance?
How can I troubleshoot my HBase cluster?

How can I improve HBase cluster performance?

 

參考 Chapter 11, Performance Tuning.

How can I troubleshoot my HBase cluster?

 

參考 Chapter 12, Troubleshooting and Debugging HBase.

A.7. Amazon EC2

I am running HBase on Amazon EC2 and...

I am running HBase on Amazon EC2 and...

 

EC2 issues are a special case. 參考 Troubleshooting Section 12.12, “Amazon EC2” and Performance Section 11.11, “Amazon EC2” sections.

A.8. Operations

How do I manage my HBase cluster?
How do I back up my HBase cluster?

How do I manage my HBase cluster?

 

參考 Chapter 14, HBase Operational Management

How do I back up my HBase cluster?

 

參考 Section 14.7, “HBase Backup”

A.9. HBase in Action

Where can I find interesting videos and presentations on HBase?

Where can I find interesting videos and presentations on HBase?

 

參考 Appendix F, Other Information About HBase

Appendix B. hbck In Depth

HBaseFsck (hbck) is a tool for checking for region consistency and table integrity problems and repairing a corrupted HBase. It works in two basic modes -- a read-only inconsistency identifying mode and a multi-phase read-write repair mode.

B.1. Running hbck to identify inconsistencies

To check to see if your HBase cluster has corruptions, run hbck against your HBase cluster:
$ ./bin/hbase hbck

At the end of the commands output it prints OK or tells you the number of INCONSISTENCIES present. You may also want to run run hbck a few times because some inconsistencies can be transient (e.g. cluster is starting up or a region is splitting). Operationally you may want to run hbck regularly and setup alert (e.g. via nagios) if it repeatedly reports inconsistencies . A run of hbck will report a list of inconsistencies along with a brief description of the regions and tables affected. The using the -details option will report more details including a representative listing of all the splits present in all the tables.

$ ./bin/hbase hbck -details

B.2. Inconsistencies

If after several runs, inconsistencies continue to be reported, you may have encountered a corruption. These should be rare, but in the event they occur newer versions of HBase include the hbck tool enabled with automatic repair options.

There are two invariants that when violated create inconsistencies in HBase:

  • HBase’s region consistency invariant is satisfied if every region is assigned and deployed on exactly one region server, and all places where this state kept is in accordance.
  • HBase’s table integrity invariant is satisfied if for each table, every possible row key resolves to exactly one region.

Repairs generally work in three phases -- a read-only information gathering phase that identifies inconsistencies, a table integrity repair phase that restores the table integrity invariant, and then finally a region consistency repair phase that restores the region consistency invariant. Starting from version 0.90.0, hbck could detect region consistency problems report on a subset of possible table integrity problems. It also included the ability to automatically fix the most common inconsistency, region assignment and deployment consistency problems. This repair could be done by using the -fix command line option. These problems close regions if they are open on the wrong server or on multiple region servers and also assigns regions to region servers if they are not open.

Starting from HBase versions 0.90.7, 0.92.2 and 0.94.0, several new command line options are introduced to aid repairing a corrupted HBase. This hbck sometimes goes by the nickname “uberhbck”. Each particular version of uber hbck is compatible with the HBase’s of the same major version (0.90.7 uberhbck can repair a 0.90.4). However, versions <=0.90.6 and versions <=0.92.1 may require restarting the master or failing over to a backup master.

B.3. Localized repairs

When repairing a corrupted HBase, it is best to repair the lowest risk inconsistencies first. These are generally region consistency repairs -- localized single region repairs, that only modify in-memory data, ephemeral zookeeper data, or patch holes in the META table. Region consistency requires that the HBase instance has the state of the region’s data in HDFS (.regioninfo files), the region’s row in the .META. table., and region’s deployment/assignments on region servers and the master in accordance. Options for repairing region consistency include:

  • -fixAssignments (equivalent to the 0.90 -fix option) repairs unassigned, incorrectly assigned or multiply assigned regions.
  • -fixMeta which removes meta rows when corresponding regions are not present in HDFS and adds new meta rows if they regions are present in HDFS while not in META.

To fix deployment and assignment problems you can run this command:

$ ./bin/hbase hbck -fixAssignments
To fix deployment and assignment problems as well as repairing incorrect meta rows you can run this command:.
$ ./bin/hbase hbck -fixAssignments -fixMeta
There are a few classes of table integrity problems that are low risk repairs. The first two are degenerate (startkey == endkey) regions and backwards regions (startkey > endkey). These are automatically handled by sidelining the data to a temporary directory (/hbck/xxxx). The third low-risk class is hdfs region holes. This can be repaired by using the:
  • -fixHdfsHoles option for fabricating new empty regions on the file system. If holes are detected you can use -fixHdfsHoles and should include -fixMeta and -fixAssignments to make the new region consistent.
$ ./bin/hbase hbck -fixAssignments -fixMeta -fixHdfsHoles
Since this is a common operation, we’ve added a the -repairHoles flag that is equivalent to the previous command:
$ ./bin/hbase hbck -repairHoles
If inconsistencies still remain after these steps, you most likely have table integrity problems related to orphaned or overlapping regions.

B.4. Region Overlap Repairs

Table integrity problems can require repairs that deal with overlaps. This is a riskier operation because it requires modifications to the file system, requires some decision making, and may require some manual steps. For these repairs it is best to analyze the output of a hbck -details run so that you isolate repairs attempts only upon problems the checks identify. Because this is riskier, there are safeguard that should be used to limit the scope of the repairs. WARNING: This is a relatively new and have only been tested on online but idle HBase instances (no reads/writes). Use at your own risk in an active production environment! The options for repairing table integrity violations include:
  • -fixHdfsOrphans option for “adopting” a region directory that is missing a region metadata file (the .regioninfo file).
  • -fixHdfsOverlaps ability for fixing overlapping regions
When repairing overlapping regions, a region’s data can be modified on the file system in two ways: 1) by merging regions into a larger region or 2) by sidelining regions by moving data to “sideline” directory where data could be restored later. Merging a large number of regions is technically correct but could result in an extremely large region that requires series of costly compactions and splitting operations. In these cases, it is probably better to sideline the regions that overlap with the most other regions (likely the largest ranges) so that merges can happen on a more reasonable scale. Since these sidelined regions are already laid out in HBase’s native directory and HFile format, they can be restored by using HBase’s bulk load mechanism. The default safeguard thresholds are conservative. These options let you override the default thresholds and to enable the large region sidelining feature.
  • -maxMerge <n> maximum number of overlapping regions to merge
  • -sidelineBigOverlaps if more than maxMerge regions are overlapping, sideline attempt to sideline the regions overlapping with the most other regions.
  • -maxOverlapsToSideline <n> if sidelining large overlapping regions, sideline at most n regions.
Since often times you would just want to get the tables repaired, you can use this option to turn on all repair options:
  • -repair includes all the region consistency options and only the hole repairing table integrity options.
Finally, there are safeguards to limit repairs to only specific tables. For example the following command would only attempt to repair table TableFoo and TableBar.
$ ./bin/hbase/ hbck -repair TableFoo TableBar

B.4.1. Special cases: Meta is not properly assigned

There are a few special cases that hbck can handle as well. Sometimes the meta table’s only region is inconsistently assigned or deployed. In this case there is a special -fixMetaOnly option that can try to fix meta assignments.
$ ./bin/hbase hbck -fixMetaOnly -fixAssignments

B.4.2. Special cases: HBase version file is missing

HBase’s data on the file system requires a version file in order to start. If this flie is missing, you can use the -fixVersionFile option to fabricating a new HBase version file. This assumes that the version of hbck you are running is the appropriate version for the HBase cluster.

B.4.3. Special case: Root and META are corrupt.

The most drastic corruption scenario is the case where the ROOT or META is corrupted and HBase will not start. In this case you can use the OfflineMetaRepair tool create new ROOT and META regions and tables. This tool assumes that HBase is offline. It then marches through the existing HBase home directory, loads as much information from region metadata files (.regioninfo files) as possible from the file system. If the region metadata has proper table integrity, it sidelines the original root and meta table directories, and builds new ones with pointers to the region directories and their data.
$ ./bin/hbase org.apache.hadoop.hbase.util.OfflineMetaRepair
NOTE: This tool is not as clever as uberhbck but can be used to bootstrap repairs that uberhbck can complete. If the tool succeeds you should be able to start hbase and run online repairs if necessary.

B.2. Inconsistencies

If after several runs, inconsistencies continue to be reported, you may have encountered a corruption. These should be rare, but in the event they occur newer versions of HBase include the hbck tool enabled with automatic repair options.

There are two invariants that when violated create inconsistencies in HBase:

  • HBase’s region consistency invariant is satisfied if every region is assigned and deployed on exactly one region server, and all places where this state kept is in accordance.
  • HBase’s table integrity invariant is satisfied if for each table, every possible row key resolves to exactly one region.

Repairs generally work in three phases -- a read-only information gathering phase that identifies inconsistencies, a table integrity repair phase that restores the table integrity invariant, and then finally a region consistency repair phase that restores the region consistency invariant. Starting from version 0.90.0, hbck could detect region consistency problems report on a subset of possible table integrity problems. It also included the ability to automatically fix the most common inconsistency, region assignment and deployment consistency problems. This repair could be done by using the -fix command line option. These problems close regions if they are open on the wrong server or on multiple region servers and also assigns regions to region servers if they are not open.

Starting from HBase versions 0.90.7, 0.92.2 and 0.94.0, several new command line options are introduced to aid repairing a corrupted HBase. This hbck sometimes goes by the nickname “uberhbck”. Each particular version of uber hbck is compatible with the HBase’s of the same major version (0.90.7 uberhbck can repair a 0.90.4). However, versions <=0.90.6 and versions <=0.92.1 may require restarting the master or failing over to a backup master.

B.3. Localized repairs

When repairing a corrupted HBase, it is best to repair the lowest risk inconsistencies first. These are generally region consistency repairs -- localized single region repairs, that only modify in-memory data, ephemeral zookeeper data, or patch holes in the META table. Region consistency requires that the HBase instance has the state of the region’s data in HDFS (.regioninfo files), the region’s row in the .META. table., and region’s deployment/assignments on region servers and the master in accordance. Options for repairing region consistency include:

  • -fixAssignments (equivalent to the 0.90 -fix option) repairs unassigned, incorrectly assigned or multiply assigned regions.
  • -fixMeta which removes meta rows when corresponding regions are not present in HDFS and adds new meta rows if they regions are present in HDFS while not in META.

To fix deployment and assignment problems you can run this command:

$ ./bin/hbase hbck -fixAssignments
To fix deployment and assignment problems as well as repairing incorrect meta rows you can run this command:.
$ ./bin/hbase hbck -fixAssignments -fixMeta
There are a few classes of table integrity problems that are low risk repairs. The first two are degenerate (startkey == endkey) regions and backwards regions (startkey > endkey). These are automatically handled by sidelining the data to a temporary directory (/hbck/xxxx). The third low-risk class is hdfs region holes. This can be repaired by using the:
  • -fixHdfsHoles option for fabricating new empty regions on the file system. If holes are detected you can use -fixHdfsHoles and should include -fixMeta and -fixAssignments to make the new region consistent.
$ ./bin/hbase hbck -fixAssignments -fixMeta -fixHdfsHoles
Since this is a common operation, we’ve added a the -repairHoles flag that is equivalent to the previous command:
$ ./bin/hbase hbck -repairHoles
If inconsistencies still remain after these steps, you most likely have table integrity problems related to orphaned or overlapping regions.

B.4. Region Overlap Repairs

Table integrity problems can require repairs that deal with overlaps. This is a riskier operation because it requires modifications to the file system, requires some decision making, and may require some manual steps. For these repairs it is best to analyze the output of a hbck -details run so that you isolate repairs attempts only upon problems the checks identify. Because this is riskier, there are safeguard that should be used to limit the scope of the repairs. WARNING: This is a relatively new and have only been tested on online but idle HBase instances (no reads/writes). Use at your own risk in an active production environment! The options for repairing table integrity violations include:
  • -fixHdfsOrphans option for “adopting” a region directory that is missing a region metadata file (the .regioninfo file).
  • -fixHdfsOverlaps ability for fixing overlapping regions
When repairing overlapping regions, a region’s data can be modified on the file system in two ways: 1) by merging regions into a larger region or 2) by sidelining regions by moving data to “sideline” directory where data could be restored later. Merging a large number of regions is technically correct but could result in an extremely large region that requires series of costly compactions and splitting operations. In these cases, it is probably better to sideline the regions that overlap with the most other regions (likely the largest ranges) so that merges can happen on a more reasonable scale. Since these sidelined regions are already laid out in HBase’s native directory and HFile format, they can be restored by using HBase’s bulk load mechanism. The default safeguard thresholds are conservative. These options let you override the default thresholds and to enable the large region sidelining feature.
  • -maxMerge <n> maximum number of overlapping regions to merge
  • -sidelineBigOverlaps if more than maxMerge regions are overlapping, sideline attempt to sideline the regions overlapping with the most other regions.
  • -maxOverlapsToSideline <n> if sidelining large overlapping regions, sideline at most n regions.
Since often times you would just want to get the tables repaired, you can use this option to turn on all repair options:
  • -repair includes all the region consistency options and only the hole repairing table integrity options.
Finally, there are safeguards to limit repairs to only specific tables. For example the following command would only attempt to repair table TableFoo and TableBar.
$ ./bin/hbase/ hbck -repair TableFoo TableBar

B.4.1. Special cases: Meta is not properly assigned

There are a few special cases that hbck can handle as well. Sometimes the meta table’s only region is inconsistently assigned or deployed. In this case there is a special -fixMetaOnly option that can try to fix meta assignments.
$ ./bin/hbase hbck -fixMetaOnly -fixAssignments

B.4.2. Special cases: HBase version file is missing

HBase’s data on the file system requires a version file in order to start. If this flie is missing, you can use the -fixVersionFile option to fabricating a new HBase version file. This assumes that the version of hbck you are running is the appropriate version for the HBase cluster.

B.4.3. Special case: Root and META are corrupt.

The most drastic corruption scenario is the case where the ROOT or META is corrupted and HBase will not start. In this case you can use the OfflineMetaRepair tool create new ROOT and META regions and tables. This tool assumes that HBase is offline. It then marches through the existing HBase home directory, loads as much information from region metadata files (.regioninfo files) as possible from the file system. If the region metadata has proper table integrity, it sidelines the original root and meta table directories, and builds new ones with pointers to the region directories and their data.
$ ./bin/hbase org.apache.hadoop.hbase.util.OfflineMetaRepair
NOTE: This tool is not as clever as uberhbck but can be used to bootstrap repairs that uberhbck can complete. If the tool succeeds you should be able to start hbase and run online repairs if necessary.

Appendix C. HBase中的壓縮

C.1. 測試壓縮工具

HBase有一個用來測試壓縮新的工具。要想運行它,輸入/bin/hbase org.apache.hadoop.hbase.util.CompressionTest. 就會有提示這個工具的具體用法

C.2.  hbase.regionserver.codecs

如果你的安裝錯誤,就會測試不成功,或者無法啓動。可以在你的hbase-site.xml加上配置 hbase.regionserver.codecs 值你需要的codecs。例如,如果hbase.regionserver.codecs 的值是 lzo,gz 同時lzo不存在或者沒有正確安裝, RegionServer在啓動的時候會提示配置錯誤。

當一臺新機器加入到集羣中的時候,管理員一定要注意,這臺新機器有可能要安裝特定的壓縮解碼器。

C.3.  LZO

很不幸,HBase是Apache的協議,而LZO是GPL的協議。HBase不能自帶LZO,因此LZO需要在安裝HBase之前安裝。參見 使用 LZO 壓縮介紹瞭如何在HBase中使用LZO

一個常見的問題是,用戶在一開始使用LZO的時候會很好,但是數月過去,管理員在給集羣添加集羣的時候,他們忘記了LZO的事情。在0.90.0版本之後,我們會運行失敗,但也有可能不。

參考 Section C.2, “ hbase.regionserver.codecs ” for a feature to help protect against failed LZO install.

C.4.  GZIP

相對於LZO,GZIP的壓縮率更高但是速度更慢。在某些特定情況下,壓縮率是優先考量的。Java會使用Java自帶的GZIP,除非Hadoop的本地庫在CLASSPATH中。在這種情況下,最好使用本地壓縮器。(如果本地庫不存在,可以在Log看到很多Got brand-new compressor。參見Q: )

C.5.  SNAPPY

If snappy is installed, HBase can make use of it (courtesy of hadoop-snappy [29]).

  1. Build and install snappy on all nodes of your cluster.

  2. Use CompressionTest to verify snappy support is enabled and the libs can be loaded ON ALL NODES of your cluster:

    $ hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://host/path/to/hbase snappy

  3. Create a column family with snappy compression and verify it in the hbase shell:

    $ hbase> create 't1', { NAME => 'cf1', COMPRESSION => 'SNAPPY' } hbase> describe 't1'

    In the output of the "describe" command, you need to ensure it lists "COMPRESSION => 'SNAPPY'"



[29參考 Alejandro's note up on the list on difference between Snappy in Hadoop and Snappy in HBase

C.6. Changing Compression Schemes

A frequent question on the dist-list is how to change compression schemes for ColumnFamilies. This is actually quite simple, and can be done via an alter command. Because the compression scheme is encoded at the block-level in StoreFiles, the table does not need to be re-created and the data does not copied somewhere else. Just make sure the old codec is still available until you are sure that all of the old StoreFiles have been compacted.

Appendix D. YCSB: 雅虎雲服務 測試 和HBase

TODO: YCSB不能很多的增加集羣負載.

TODO: 如果給HBase安裝

Ted Dunning重做了YCSV,這個是用maven管理了,加入了覈實工作量的功能。參見 Ted Dunning's YCSB.

Appendix E. HFile format version 2

E.1. Motivation

Note: this feature was introduced in HBase 0.92

We found it necessary to revise the HFile format after encountering high memory usage and slow startup times caused by large Bloom filters and block indexes in the region server. Bloom filters can get as large as 100 MB per HFile, which adds up to 2 GB when aggregated over 20 regions. Block indexes can grow as large as 6 GB in aggregate size over the same set of regions. A region is not considered opened until all of its block index data is loaded. Large Bloom filters produce a different performance problem: the first get request that requires a Bloom filter lookup will incur the latency of loading the entire Bloom filter bit array.

To speed up region server startup we break Bloom filters and block indexes into multiple blocks and write those blocks out as they fill up, which also reduces the HFile writer’s memory footprint. In the Bloom filter case, “filling up a block” means accumulating enough keys to efficiently utilize a fixed-size bit array, and in the block index case we accumulate an “index block” of the desired size. Bloom filter blocks and index blocks (we call these “inline blocks”) become interspersed with data blocks, and as a side effect we can no longer rely on the difference between block offsets to determine data block length, as it was done in version 1.

HFile is a low-level file format by design, and it should not deal with application-specific details such as Bloom filters, which are handled at StoreFile level. Therefore, we call Bloom filter blocks in an HFile "inline" blocks. We also supply HFile with an interface to write those inline blocks.

Another format modification aimed at reducing the region server startup time is to use a contiguous “load-on-open” section that has to be loaded in memory at the time an HFile is being opened. Currently, as an HFile opens, there are separate seek operations to read the trailer, data/meta indexes, and file info. To read the Bloom filter, there are two more seek operations for its “data” and “meta” portions. In version 2, we seek once to read the trailer and seek again to read everything else we need to open the file from a contiguous block.

E.2. HFile format version 1 overview

As we will be discussing the changes we are making to the HFile format, it is useful to give a short overview of the previous (HFile version 1) format. An HFile in the existing format is structured as follows: HFile Version 1 [30]

E.2.1.  Block index format in version 1

The block index in version 1 is very straightforward. For each entry, it contains:

  1. Offset (long)

  2. Uncompressed size (int)

  3. Key (a serialized byte array written using Bytes.writeByteArray)

    1. Key length as a variable-length integer (VInt)

    2. Key bytes

The number of entries in the block index is stored in the fixed file trailer, and has to be passed in to the method that reads the block index. One of the limitations of the block index in version 1 is that it does not provide the compressed size of a block, which turns out to be necessary for decompression. Therefore, the HFile reader has to infer this compressed size from the offset difference between blocks. We fix this limitation in version 2, where we store on-disk block size instead of uncompressed size, and get uncompressed size from the block header.



[30Image courtesy of Lars George, hbase-architecture-101-storage.html.

E.3.  HBase file format with inline blocks (version 2)

E.3.1.  概述

The version of HBase introducing the above features reads both version 1 and 2 HFiles, but only writes version 2 HFiles. A version 2 HFile is structured as follows: HFile Version 2

E.3.2. Unified version 2 block format

In the version 2 every block in the data section contains the following fields:

  1. 8 bytes: Block type, a sequence of bytes equivalent to version 1's "magic records". Supported block types are:

    1. DATA – data blocks

    2. LEAF_INDEX – leaf-level index blocks in a multi-level-block-index

    3. BLOOM_CHUNK – Bloom filter chunks

    4. META – meta blocks (not used for Bloom filters in version 2 anymore)

    5. INTERMEDIATE_INDEX – intermediate-level index blocks in a multi-level blockindex

    6. ROOT_INDEX – root>level index blocks in a multi>level block index

    7. FILE_INFO – the “file info” block, a small key>value map of metadata

    8. BLOOM_META – a Bloom filter metadata block in the load>on>open section

    9. TRAILER – a fixed>size file trailer. As opposed to the above, this is not an HFile v2 block but a fixed>size (for each HFile version) data structure

    10. INDEX_V1 – this block type is only used for legacy HFile v1 block

  2. Compressed size of the block's data, not including the header (int).

    Can be used for skipping the current data block when scanning HFile data.

  3. Uncompressed size of the block's data, not including the header (int)

    This is equal to the compressed size if the compression algorithm is NON

  4. File offset of the previous block of the same type (long)

    Can be used for seeking to the previous data/index block

  5. Compressed data (or uncompressed data if the compression algorithm is NONE).

The above format of blocks is used in the following HFile sections:

  1. Scanned block section. The section is named so because it contains all data blocks that need to be read when an HFile is scanned sequentially.  Also contains leaf block index and Bloom chunk blocks.

  2. Non-scanned block section. This section still contains unified-format v2 blocks but it does not have to be read when doing a sequential scan. This section contains “meta” blocks and intermediate-level index blocks.

We are supporting “meta” blocks in version 2 the same way they were supported in version 1, even though we do not store Bloom filter data in these blocks anymore.

E.3.3.  Block index in version 2

There are three types of block indexes in HFile version 2, stored in two different formats (root and non-root):

  1. Data index — version 2 multi-level block index, consisting of:

    1. Version 2 root index, stored in the data block index section of the file

    2. Optionally, version 2 intermediate levels, stored in the non%root format in the data index section of the file. Intermediate levels can only be present if leaf level blocks are present

    3. Optionally, version 2 leaf levels, stored in the non%root format inline with data blocks

  2. Meta index — version 2 root index format only, stored in the meta index section of the file

  3. Bloom index — version 2 root index format only, stored in the “load-on-open” section as part of Bloom filter metadata.

E.3.4.  Root block index format in version 2

This format applies to:

  1. Root level of the version 2 data index

  2. Entire meta and Bloom indexes in version 2, which are always single-level.

A version 2 root index block is a sequence of entries of the following format, similar to entries of a version 1 block index, but storing on-disk size instead of uncompressed size.

  1. Offset (long)

    This offset may point to a data block or to a deeper>level index block.

  2. On-disk size (int)

  3. Key (a serialized byte array stored using Bytes.writeByteArray)

    1. Key (VInt)

    2. Key bytes

A single-level version 2 block index consists of just a single root index block. To read a root index block of version 2, one needs to know the number of entries. For the data index and the meta index the number of entries is stored in the trailer, and for the Bloom index it is stored in the compound Bloom filter metadata.

For a multi-level block index we also store the following fields in the root index block in the load-on-open section of the HFile, in addition to the data structure described above:

  1. Middle leaf index block offset

  2. Middle leaf block on-disk size (meaning the leaf index block containing the reference to the “middle” data block of the file)

  3. The index of the mid-key (defined below) in the middle leaf-level block.

These additional fields are used to efficiently retrieve the mid-key of the HFile used in HFile splits, which we define as the first key of the block with a zero-based index of (n – 1) / 2, if the total number of blocks in the HFile is n. This definition is consistent with how the mid-key was determined in HFile version 1, and is reasonable in general, because blocks are likely to be the same size on average, but we don’t have any estimates on individual key/value pair sizes.

When writing a version 2 HFile, the total number of data blocks pointed to by every leaf-level index block is kept track of. When we finish writing and the total number of leaf-level blocks is determined, it is clear which leaf-level block contains the mid-key, and the fields listed above are computed.  When reading the HFile and the mid-key is requested, we retrieve the middle leaf index block (potentially from the block cache) and get the mid-key value from the appropriate position inside that leaf block.

E.3.5.  Non-root block index format in version 2

This format applies to intermediate-level and leaf index blocks of a version 2 multi-level data block index. Every non-root index block is structured as follows.

  1. numEntries: the number of entries (int).

  2. entryOffsets: the “secondary index” of offsets of entries in the block, to facilitate a quick binary search on the key (numEntries + 1 int values). The last value is the total length of all entries in this index block. For example, in a non-root index block with entry sizes 60, 80, 50 the “secondary index” will contain the following int array: {0, 60, 140, 190}.

  3. Entries. Each entry contains:

    1. Offset of the block referenced by this entry in the file (long)

    2. On>disk size of the referenced block (int)

    3. Key. The length can be calculated from entryOffsets.

E.3.6.  Bloom filters in version 2

In contrast with version 1, in a version 2 HFile Bloom filter metadata is stored in the load-on-open section of the HFile for quick startup.

  1. A compound Bloom filter.

    1. Bloom filter version = 3 (int). There used to be a DynamicByteBloomFilter class that had the Bloom filter version number 2

    2. The total byte size of all compound Bloom filter chunks (long)

    3. Number of hash functions (int

    4. Type of hash functions (int)

    5. The total key count inserted into the Bloom filter (long)

    6. The maximum total number of keys in the Bloom filter (long)

    7. The number of chunks (int)

    8. Comparator class used for Bloom filter keys, a UTF>8 encoded string stored using Bytes.writeByteArray

    9. Bloom block index in the version 2 root block index format

E.3.7. File Info format in versions 1 and 2

The file info block is a serialized HBaseMapWritable (essentially a map from byte arrays to byte arrays) with the following keys, among others. StoreFile-level logic adds more keys to this.

hfile.LASTKEY

The last key of the file (byte array)

hfile.AVG_KEY_LEN

The average key length in the file (int)

hfile.AVG_VALUE_LEN

The average value length in the file (int)

File info format did not change in version 2. However, we moved the file info to the final section of the file, which can be loaded as one block at the time the HFile is being opened. Also, we do not store comparator in the version 2 file info anymore. Instead, we store it in the fixed file trailer. This is because we need to know the comparator at the time of parsing the load-on-open section of the HFile.

E.3.8.  Fixed file trailer format differences between versions 1 and 2

The following table shows common and different fields between fixed file trailers in versions 1 and 2. Note that the size of the trailer is different depending on the version, so it is “fixed” only within one version. However, the version is always stored as the last four-byte integer in the file.

Version 1

Version 2

File info offset (long)

Data index offset (long)

loadOnOpenOffset (long)

The offset of the section that we need toload when opening the file.

Number of data index entries (int)

metaIndexOffset (long)

This field is not being used by the version 1 reader, so we removed it from version 2.

uncompressedDataIndexSize (long)

The total uncompressed size of the whole data block index, including root-level, intermediate-level, and leaf-level blocks.

Number of meta index entries (int)

Total uncompressed bytes (long)

numEntries (int)

numEntries (long)

Compression codec: 0 = LZO, 1 = GZ, 2 = NONE (int)

The number of levels in the data block index (int)

firstDataBlockOffset (long)

The offset of the first first data block. Used when scanning.

lastDataBlockEnd (long)

The offset of the first byte after the last key/value data block. We don't need to go beyond this offset when scanning.

Version: 1 (int)

Version: 2 (int)

Appendix F. Other Information About HBase

F.1. HBase Videos

Introduction to HBase

Building Real Time Services at Facebook with HBase by Jonathan Gray (Hadoop World 2011).

HBase and Hadoop, Mixing Real-Time and Batch Processing at StumbleUpon by JD Cryans (Hadoop World 2010).

F.2. HBase Presentations (Slides)

Advanced HBase Schema Design by Lars George (Hadoop World 2011).

Introduction to HBase by Todd Lipcon (Chicago Data Summit 2011).

Getting The Most From Your HBase Install by Ryan Rawson, Jonathan Gray (Hadoop World 2009).

F.3. HBase Papers

BigTable by Google (2006).

HBase and HDFS Locality by Lars George (2010).

No Relation: The Mixed Blessings of Non-Relational Databases by Ian Varley (2009).

F.4. HBase Sites

Cloudera's HBase Blog has a lot of links to useful HBase information.

  • CAP Confusion is a relevant entry for background information on distributed storage systems.

HBase Wiki has a page with a number of presentations.

F.5. HBase Books

HBase: The Definitive Guide by Lars George.

F.6. Hadoop Books

Hadoop: The Definitive Guide by Tom White.

Appendix H. HBase and the Apache Software Foundation

HBase is a project in the Apache Software Foundation and as such there are responsibilities to the ASF to ensure a healthy project.

H.1. ASF Development Process

參考 the Apache Development Process page for all sorts of information on how the ASF is structured (e.g., PMC, committers, contributors), to tips on contributing and getting involved, and how open-source works at ASF.

H.2. ASF Board Reporting

Once a quarter, each project in the ASF portfolio submits a report to the ASF board. This is done by the HBase project lead and the committers. 參考 ASF board reporting for more information.

Appendix I. Enabling Dapper-like Tracing in HBase

HBASE-6449 added support for tracing requests through HBase, using the open source tracing library, HTrace. Setting up tracing is quite simple, however it currently requires some very minor changes to your client code (it would not be very difficult to remove this requirement).

I.1. SpanReceivers

The tracing system works by collecting information in structs called ‘Spans’. It is up to you to choose how you want to receive this information by implementing the SpanReceiver interface, which defines one method:

public void receiveSpan(Span span);

This method serves as a callback whenever a span is completed. HTrace allows you to use as many SpanReceivers as you want so you can easily send trace information to multiple destinations.

Configure what SpanReceivers you’d like to use by putting a comma separated list of the fully-qualified class name of classes implementing SpanReceiver in hbase-site.xml property: hbase.trace.spanreceiver.classes.

HBase includes a HBaseLocalFileSpanReceiver that writes all span information to local files in a JSON-based format. TheHBaseLocalFileSpanReceiver looks in hbase-site.xml for a hbase.trace.spanreceiver.localfilespanreceiver.filename property with a value describing the name of the file to which nodes should write their span information.

If you do not want to use the included HBaseLocalFileSpanReceiver, you are encouraged to write your own receiver (take a look at HBaseLocalFileSpanReceiver for an example). If you think others would benefit from your receiver, file a JIRA or send a pull request to HTrace.

I.2. Client Modifications

Currently, you must turn on tracing in your client code. To do this, you simply turn on tracing for requests you think are interesting, and turn it off when the request is done.

For example, if you wanted to trace all of your get operations, you change this:

HTable table = new HTable(...); Get get = new Get(...);

into:

Span getSpan = Trace.startSpan(“doing get”, Sampler.ALWAYS); try { HTable table = new HTable(...); Get get = new Get(...); ... } finally { getSpan.stop(); }

If you wanted to trace half of your ‘get’ operations, you would pass in:

new ProbabilitySampler(0.5)

in lieu of Sampler.ALWAYS to Trace.startSpan(). See the HTrace README for more information on Samplers.

 

Index

C

Cells, Cells
Column Family, Column Family
Column Family Qualifier, Column Family
Compression, Compression In HBase
 

H

Hadoop, hadoop

L

LargeTests

M

MediumTests,
MSLAB, Long GC pauses

N

nproc, ulimit 和 nproc

S

SmallTests,

U

ulimit, ulimit 和 nproc

V

Versions, 版本

Z

ZooKeeper, ZooKeepe
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章