hbase組合rowkey和partial key scan

partial key scan並沒有反應其特點,應該叫prefix key scan更好些,也就是說必須作爲前綴纔有意義,若是中間的key,就不行了。

比如rowkey形式爲<key1>-<key2>-<key3>

以key2或key3無法做partial scan。


對於該問題幾種解決辦法:

1)冗餘。建另外一張表,以要查詢的子key比如key2放在組合rowkey開始位置。

2)利用某子key數據少的特點。比如若key3數據較少,可以將其放在rowkey開始位置:<key3>-<key2>-<key1>,若有對key2的查詢,可以枚舉key3來依次構造key3-key2前綴進行partial scan。

參見http://stackoverflow.com/questions/12908378/hbase-searching-by-part-of-a-key

3)fuzzy row filter。

可以構建通配符形式的中間子key的scan。(但匹配key必須爲固定長度)

本質上還是full scan,但是由於略過一部分數據,scan性能提到提升。---能提升多少取決於能略過多少數據,若要過濾key的集合很大對應row很多,基本上沒法略過,要一一匹配,就沒太大意義了。

參見http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/

Performance of the scan based on FuzzyRowFilter usually depends on the cardinality of the fuzzy part. E.g. in the example above, if users number is several hundreds to several thousand, the scan should be very fast: there will only be several hundreds or thousand “jumps” and huge amount of rows might be skipped. If the cardinality is high then scan can take a lot of time.The worst-case scenario is when you have N records and N users, i.e. one record per user. In this case there’s simply nothing to skip.


http://grokbase.com/t/hbase/user/13agwrnmej/how-to-query-based-on-partial-row-key

You can use RowFilter in combination with RegexStringComparator.
See RegexStringComparator's javadoc:

  * This comparator is for use with {@link CompareFilter} implementations,
such

  * as {@link RowFilter}, {@link QualifierFilter}, and {@link ValueFilter},
for

See also TestFilter#testRowFilter()


You can also try Phoenix, which does this automatically for you.
(https://github.com/forcedotcom/phoenix)



rowkey 設計要點:

1.

參見<<hbase definitive guide>>裏的描述。


2.另外官網上也有一些。

(http://hbase.apache.org/book/schema.html,http://hbase.apache.org/book/schema.casestudies.html)

1)column family要儘量少(<=2),儘量使用1個。---compaction是per region來做,一個family要flush,其他family都要。

2)rowkey,columnfamily,  column key長度儘量短。這是hbase存儲格式決定的。每個cell都有rowkey, columnfamily, column key前綴。(有了壓縮應該能解決這個問題?)

另外數據類型比如int, String的存儲空間是不同的,int要省空間。





發佈了11 篇原創文章 · 獲贊 4 · 訪問量 5萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章