一、Memstore存在的意義?
在HBase中,每個HRegionServer上有多個HRegion,每個HRegion上有多個HStore,而Memestore作爲一個HStore的組成部分,當我們大量寫操作發生的時候,如果超過了Memstore的設置閥值,就會執行flush到Hfile文件的操作。默認情況下hbase底層存儲的文件系統爲hdfs,但是HDFS在存儲的時候直接就存了原始的數據,沒有對數據進行相關的優化,比如rowkey排序,版本過濾等操作,而我們使用hbase就是想要支持快速的檢索,那麼就必須保證rowkey的順序,hbase在設計的時候,加了Memstore這層,一是加快響應,二是在數據flush到磁盤之前,先排好序,先過濾垃圾數據(比如某些column family 只需要最新版本不需要存多個版本)。這樣的設計,在flush到hfile的時候已經對數據進行了優化,檢索的時候就快很多了。
hbase提供的配置選項裏面有幾個關於memstore的,在介紹之前,需要知曉的是hbase的memstore的flush操作的最小執行單元是一個HRegion,首先我們來分別看一下幾個關於memestore操作的幾個配置:
hbase.hregion.memstore.flush.size
該值表示每個HRegionServer上單個HRegion裏的單個HStore裏的Memstore的內存大小閥值,默認爲128M,當單個Memstore超過這個大小時,會觸發這個HRegion的Memstore進行flush操作(首先最小單元的HRegion,如果多個HStore中有一個HStore的Memstore的大小超過了這個閥值,就會觸發整個HRegion的Memstore進行flush操作,flush操作不阻塞更新)。需要注意的是,隨着數據量的越來越大,單個HRegionServer上的HRegion會變得越來越多,隨之改變的就是這個HReionServer上的總的Memstore的大小會變得越來越大。
hbase.regionserver.global.memstore.size
當單個HRegionServer上的的所有的HRegion對應的所有的Memstore之和超過了該配置,也會強制進行flush操作,而且還會阻塞更新(這是最不希望看到的,因爲阻塞了這個HRegionServer上的更新操作,將會影響在這個HRegionServer上所有的HRegion的讀寫)。默認情況下, hbase.regionserver.global.memstore.size的大小爲堆大小的40%的,當觸發了flush操作之後且這個HRegionServer的Memstore內存大小下降到
hbase.regionserver.global.memstore.lowerLimit *hbase.regionserver.global.memstore.upperLimit * hbase_heapsize
的配置的時候,釋放阻塞操作(這個地方很巧妙,他不是一直flush,因爲該flush操作會阻塞對當前這個HRegionServer的更新,而是隻要flush到一個可以允許的最小值,就不阻塞)。
hbase.hregion.memstore.block.multiplier
我們知道,一個HRegion裏有N個HStore分別對應表的不同column Family,該參數的配置就是如果一個HRegion裏的所有Memstore大小超過了
hbase.hregion.memstore.block.multiplier * Hbase.hRegion.memstore.flush.size
大小,也會觸發這個HRegion的flush操作。
舉個例子:
heap:1G
hbase.regionserver.global.memstore.size = 1*1024M*40%=410M
hbase.regionserver.global.memstore.size.lower.limit =1*1024M*40%*0.95=390M
hbase.hregion.memstore.flush.size = 128M
hbase.hregion.memstore.block.multiplier = 4
現在假設:單個HRegionServer上有4個HRegion,其中每個HRgion裏面只有一個HStore,
HStore1 已使用memstore 100M
HStore2 已使用memstore 110M
HStore3 已使用memstore 110M
HStore4 已使用memstore 100M
雖然單個HStore的都沒有超過默認的128M配置,但是總大小已經超過了
hbase.regionserver.global.memstore.size的值 那麼也會觸發flush操作,並且還會阻塞這個HRegion的更新操作。
所以,我們要權衡單個HRegionServer上的總的HRegion的個數,以及一個HRegion裏面的HStore數,合理設置上述配置值。 說完了一些關於memstore的配置,實際就是觸發執行Memstore-flush操作的時機。
二、HRegionServer級別的flush操作源碼分析
當觸發了HRegionServer級別的flush,會阻塞更新,在每個HRegionServer觸發了flush之後,實際還是會細化到HRegion級別的flush。因爲在執行flush的時候肯定是每個HS裏面的HRegion分別進行flush操作。
在HRgionServer類裏面,有一個成員變量,專門用來處理flush操作
protected MemStoreFlusher cacheFlusher;
點進去看具體的實現,MemstoreFlusher類的源碼,下面是一些重要的變量定義
/** 一個延遲非阻塞隊列,裏面放的是待flush的HRegion */
private final BlockingQueue<FlushQueueEntry> flushQueue =
new DelayQueue<FlushQueueEntry>();
/**Map類型,key爲代刷新的HRegion,value爲改HRegion做了一次封裝後的對象 */
private final Map<Region, FlushRegionEntry> regionsInQueue =
new HashMap<Region, FlushRegionEntry>();
/**線程喚醒 */
private AtomicBoolean wakeupPending = new AtomicBoolean();
/**線程喚醒頻率*/
private final long threadWakeFrequency;
//持有HRegionServer的引用
private final HRegionServer server;
/** 讀寫鎖*/
private final ReentrantReadWriteLock lock = new ReentrantReadWriteLock();
/**一個對象,在線程wait和Notify的時候使用 */
private final Object blockSignal = new Object();
/**全局的Memstore大小限制 */
protected long globalMemStoreLimit;
/**限制因子的百分比 */
protected float globalMemStoreLimitLowMarkPercent;
/**限制大小 */
protected long globalMemStoreLimitLowMark;
/**阻塞等待時間 */
private long blockingWaitTime;
private final Counter updatesBlockedMsHighWater = new Counter();
/**處理flush操作的線程數 */
private final FlushHandler[] flushHandlers;
private List<FlushRequestListener> flushRequestListeners = new ArrayList<FlushRequestListener>(1);
在構造函數裏面進行了初始化操作
public MemStoreFlusher(final Configuration conf,
final HRegionServer server) {
super();
this.conf = conf;
this.server = server;
/**線程喚醒頻率,默認10s,主要爲了防止處理HRegion執行flush操作的線程休眠 */
this.threadWakeFrequency =
conf.getLong(HConstants.THREAD_WAKE_FREQUENCY, 10 * 1000);
/**獲取最大的堆大小 */
long max = ManagementFactory.getMemoryMXBean().getHeapMemoryUsage().getMax();
/** 獲取全局memstore所佔堆內存的百分比globalMemStorePercent,默認是0.4f */
float globalMemStorePercent = HeapMemorySizeUtil.getGlobalMemStorePercent(conf, true);
/**計算全局Memstore的內存大小限制,默認是堆內存的40% */
this.globalMemStoreLimit = (long) (max * globalMemStorePercent);
/**獲取全局Memstore的內存限制的最低百分比 ,默認配置的0.95f*/
this.globalMemStoreLimitLowMarkPercent =
HeapMemorySizeUtil.getGlobalMemStoreLowerMark(conf, globalMemStorePercent);
/**獲取全局Memstore的內存限制的最低值,默認是堆大小 * 0.4 * 0.95 */
this.globalMemStoreLimitLowMark =
(long) (this.globalMemStoreLimit * this.globalMemStoreLimitLowMarkPercent);
/**阻塞等待時間 */
this.blockingWaitTime = conf.getInt("hbase.hstore.blockingWaitTime",
90000);
/**處理隊列裏面待Flush操作的HRegion的線程數,默認是2個 */
int handlerCount = conf.getInt("hbase.hstore.flusher.count", 2);
this.flushHandlers = new FlushHandler[handlerCount];
LOG.info("globalMemStoreLimit="
+ TraditionalBinaryPrefix.long2String(this.globalMemStoreLimit, "", 1)
+ ", globalMemStoreLimitLowMark="
+ TraditionalBinaryPrefix.long2String(this.globalMemStoreLimitLowMark, "", 1)
+ ", maxHeap=" + TraditionalBinaryPrefix.long2String(max, "", 1));
}
將需要執行flush的HRegion加入隊列
Override
public void requestFlush(Region r, boolean forceFlushAllStores) {
synchronized (regionsInQueue) {
/**如果隊列裏面沒有這個Region */
if (!regionsInQueue.containsKey(r)) {
/**構造一個FlushRegionEntry ,包裝一下Region,這個裏面沒有延遲時間的設置,所有入隊後就會馬上出隊去執行flush操作 */
FlushRegionEntry fqe = new FlushRegionEntry(r, forceFlushAllStores);
/**放入map */
this.regionsInQueue.put(r, fqe);
/** 加入待flush的隊列 */
this.flushQueue.add(fqe);
}
}
}
有延遲時間設置的隊列
@Override
public void requestDelayedFlush(Region r, long delay, boolean forceFlushAllStores) {
synchronized (regionsInQueue) {
if (!regionsInQueue.containsKey(r)) {
// This entry has some delay
FlushRegionEntry fqe = new FlushRegionEntry(r, forceFlushAllStores);
/**設置過期時間 */
fqe.requeue(delay);
this.regionsInQueue.put(r, fqe);
this.flushQueue.add(fqe);
}
}
}
定時任務觸發後執行Flush操作private boolean flushRegion(final FlushRegionEntry fqe) {
Region region = fqe.region;
/**如果region是meta region或者說這個region的hfile太多了,都不執行flush操作 */
if (!region.getRegionInfo().isMetaRegion() &&
isTooManyStoreFiles(region)) {
/**文件太多,需要在阻塞時間結束後去執行合併操作 */
if
(fqe.isMaximumWait(this.blockingWaitTime)) {
LOG.info("Waited " + (EnvironmentEdgeManager.currentTime() - fqe.createTime) +
"ms on a compaction to clean up 'too many store files'; waited " +
"long enough... proceeding with flush of " +
region.getRegionInfo().getRegionNameAsString());
} else {
// If this is first time we've been put off, then emit a log message.
/**如果我們是第一次被推遲執行flush操作(就是說還在阻塞當中),說明有可能文件太多(因爲hregion下的hfile太多的化,flush操作會很耗時,而hregionServer的flush操作又是阻塞更新的,所以這裏加個限制條件,避免長時間的阻塞) */
if (fqe.getRequeueCount() <= 0) {
// Note: We don't impose blockingStoreFiles constraint on meta regions
LOG.warn("Region " + region.getRegionInfo().getRegionNameAsString() + " has too many " +
"store files; delaying flush up to " + this.blockingWaitTime + "ms");
/**判斷當前hRegion是否拆分,如果不拆分,就進行hfile的合併 */
if (!this.server.compactSplitThread.requestSplit(region)) {
try {
this.server.compactSplitThread.requestSystemCompaction(
region, Thread.currentThread().getName());
} catch (IOException e) {
LOG.error("Cache flush failed for region " +
Bytes.toStringBinary(region.getRegionInfo().getRegionName()),
RemoteExceptionHandler.checkIOException(e));
}
}
}
// Put back on the queue. Have it come back out of the queue
// after a delay of this.blockingWaitTime / 100 ms.
/**重新放入隊列,設置一個延遲時間*/ this.flushQueue.add(fqe.requeue(this.blockingWaitTime / 100));
// Tell a lie, it's not flushed but it's ok
return true;
}
}
/** 其它情況,執行真正的flush*/
return flushRegion(region, false, fqe.isForceFlushAllStores());
}
真正執行flush
private boolean flushRegion(final Region region, final boolean emergencyFlush,
boolean forceFlushAllStores) {
long startTime = 0;
synchronized (this.regionsInQueue) {
/**先從regionsInQueue裏面移除對應的region */
FlushRegionEntry fqe = this.regionsInQueue.remove(region);
// Use the start time of the FlushRegionEntry if available
if (fqe != null) {
/**獲取flush的開始時間 */
startTime = fqe.createTime;
}
/** 如果是強制刷新,直接將其從flushQueue裏面remove調,不通過flushQueue.poll操作進行。強制刷新在Region Spilt的時候會觸發,在Spilt之前,必須保證需要拆分的Region的Memestore數據刷入磁盤*/
if (fqe != null && emergencyFlush) {
// Need to remove from region from delay queue. When NOT an
// emergencyFlush, then item was removed via a flushQueue.poll.
flushQueue.remove(fqe);
}
}
if (startTime == 0) {
// Avoid getting the system time unless we don't have a FlushRegionEntry;
// shame we can't capture the time also spent in the above synchronized
// block
startTime = EnvironmentEdgeManager.currentTime();
}
/**加讀鎖,阻塞寫鎖線程 */
lock.readLock().lock();
try {
/**通知flush操作的請求者,本次flush操作的類型是什麼,類型有 NORMAL, ABOVE_LOWER_MARK, ABOVE_HIGHER_MARK; */
notifyFlushRequest(region, emergencyFlush);
/** 執行flush*/
FlushResult flushResult = region.flush(forceFlushAllStores);
/**判斷flush後hfile是否需要進行合併 */
boolean shouldCompact = flushResult.isCompactionNeeded();
// We just want to check the size
/**判斷是否需要進行HRegion的拆分 */
boolean shouldSplit = ((HRegion)region).checkSplit() != null;
if (shouldSplit) {
this.server.compactSplitThread.requestSplit(region);
} else if (shouldCompact) {
server.compactSplitThread.requestSystemCompaction(
region, Thread.currentThread().getName());
}
if (flushResult.isFlushSucceeded()) {
long endTime = EnvironmentEdgeManager.currentTime();
server.metricsRegionServer.updateFlushTime(endTime - startTime);
}
} catch (DroppedSnapshotException ex) {
// Cache flush can fail in a few places. If it fails in a critical
// section, we get a DroppedSnapshotException and a replay of wal
// is required. Currently the only way to do this is a restart of
// the server. Abort because hdfs is probably bad (HBASE-644 is a case
// where hdfs was bad but passed the hdfs check).
server.abort("Replay of WAL required. Forcing server shutdown", ex);
return false;
} catch (IOException ex) {
LOG.error("Cache flush failed" + (region != null ? (" for region " +
Bytes.toStringBinary(region.getRegionInfo().getRegionName())) : ""),
RemoteExceptionHandler.checkIOException(ex));
if (!server.checkFileSystem()) {
return false;
}
} finally {
/**flush完成後釋放讀鎖,並喚醒阻塞的其他線程 */
lock.readLock().unlock();
wakeUpIfBlocking();
}
return true;
}
上面幾個方法基本覆蓋了入隊和執行flush操作,下面來看一看什麼時候觸發,觸發的時機很多(只要在執行操作的時候超過了上文提到的幾個配置閥值或者是通過hbase shell手動觸發),這裏主要看hbase裏面固有的flushHander線程定時觸發。
private class FlushHandler extends HasThread {
private FlushHandler(String name) {
super(name);
}
@Override
public void run() {
while (!server.isStopped()) {
FlushQueueEntry fqe = null;
try {
wakeupPending.set(false); // allow someone to wake us up again
/**從隊列裏面取出一個待Flush的region */
fqe = flushQueue.poll(threadWakeFrequency, TimeUnit.MILLISECONDS);
/**如果爲Null或者是WakeupFlushThread,WakeupFlushThread是一個盾牌,放在隊列裏面,每次遇到就判斷下是否超過了memstore的限制,如果超過了,就會選擇一個Hregion進行flush,降低memstore的大小,第二個作用是用來喚醒flush線程,保證flushHander線程不休眠 */
if (fqe == null || fqe instanceof WakeupFlushThread) {
/**如果這個RS上的總的memstore大小超過了閥值 */
if (isAboveLowWaterMark()) {
LOG.debug("Flush thread woke up because memory above low water="
+ TraditionalBinaryPrefix.long2String(globalMemStoreLimitLowMark, "", 1));
/**flush一個hregion的Memstore,降低memstore的總大小 */
if (!flushOneForGlobalPressure()) {
// Wasn't able to flush any region, but we're above low water mark
// This is unlikely to happen, but might happen when closing the
// entire server - another thread is flushing regions. We'll just
// sleep a little bit to avoid spinning, and then pretend that
// we flushed one, so anyone blocked will check again
Thread.sleep(1000);
wakeUpIfBlocking();
}
// Enqueue another one of these tokens so we'll wake up again
wakeupFlushThread();
}
continue;
}
FlushRegionEntry fre = (FlushRegionEntry) fqe;
/**如果是正常的待flush的Hregion,執行flushRegion操作 */
if (!flushRegion(fre)) {
break;
}
} catch (InterruptedException ex) {
continue;
} catch (ConcurrentModificationException ex) {
continue;
} catch (Exception ex) {
LOG.error("Cache flusher failed for entry " + fqe, ex);
if (!server.checkFileSystem()) {
break;
}
}
}
/**flush完了以後,清空隊列裏面的數據 */
synchronized (regionsInQueue) {
regionsInQueue.clear();
flushQueue.clear();
}
// Signal anyone waiting, so they see the close flag
/**喚醒等待的線程*/
wakeUpIfBlocking();
LOG.info(getName() + " exiting");
}
}
重點看一下
flushOneForGlobalPressure
private boolean flushOneForGlobalPressure() {
/** 獲取當前RS上的HRegion,按照Memstore從大到小排序,返回二者的映射關係 */
SortedMap<Long, Region> regionsBySize = server.getCopyOfOnlineRegionsSortedBySize();
/**定義set,去重 */
Set<Region> excludedRegions = new HashSet<Region>();
double secondaryMultiplier
= ServerRegionReplicaUtil.getRegionReplicaStoreFileRefreshMultiplier(conf);
boolean flushedOne = false;
while (!flushedOne) {
// Find the biggest region that doesn't have too many storefiles
// (might be null!)
/** 找到一個最有可能被執行flush操作的,且這個hregion裏面hfile的個數不是很多的region*/
Region bestFlushableRegion = getBiggestMemstoreRegion(regionsBySize, excludedRegions, true);
// Find the biggest region, total, even if it might have too many flushes.
/**找到memstore最大的Hregion,不管這個hregion裏面的hfile個數有多少 */
Region bestAnyRegion = getBiggestMemstoreRegion(
regionsBySize, excludedRegions, false);
// Find the biggest region that is a secondary region
/**找到第二大的Hregion */
Region bestRegionReplica = getBiggestMemstoreOfRegionReplica(regionsBySize,
excludedRegions);
if (bestAnyRegion == null && bestRegionReplica == null) {
LOG.error("Above memory mark but there are no flushable regions!");
return false;
}
Region regionToFlush;
/**如果memstore最大的Hregion對應的Memstore的大小 > 2* 最有可能被執行flush操作對應的hregion(memstore不小,且hfile不多) */
if (bestFlushableRegion != null &&
bestAnyRegion.getMemstoreSize() > 2 * bestFlushableRegion.getMemstoreSize()) {
// Even if it's not supposed to be flushed, pick a region if it's more than twice
// as big as the best flushable one - otherwise when we're under pressure we make
// lots of little flushes and cause lots of compactions, etc, which just makes
// life worse!
if (LOG.isDebugEnabled()) {
LOG.debug("Under global heap pressure: " + "Region "
+ bestAnyRegion.getRegionInfo().getRegionNameAsString()
+ " has too many " + "store files, but is "
+ TraditionalBinaryPrefix.long2String(bestAnyRegion.getMemstoreSize(), "", 1)
+ " vs best flushable region's "
+ TraditionalBinaryPrefix.long2String(bestFlushableRegion.getMemstoreSize(), "", 1)
+ ". Choosing the bigger.");
}
/**選擇Hifle不多且memstore也不小的Hregion執行flush */
regionToFlush = bestAnyRegion;
} else {
if (bestFlushableRegion == null) {
regionToFlush = bestAnyRegion;
} else {
regionToFlush = bestFlushableRegion;
}
}
Preconditions.checkState(
(regionToFlush != null && regionToFlush.getMemstoreSize() > 0) ||
(bestRegionReplica != null && bestRegionReplica.getMemstoreSize() > 0));
/**如果選擇出來待flush的region爲null 或者 第二個可能被執行flush操作的region對應的memstore大小 > 4 * 選擇出來的待刷新的hregion對應的Memstore的大小, 那麼就執行再次刷新操作 */
if (regionToFlush == null ||
(bestRegionReplica != null &&
ServerRegionReplicaUtil.isRegionReplicaStoreFileRefreshEnabled(conf) &&
(bestRegionReplica.getMemstoreSize()
> secondaryMultiplier * regionToFlush.getMemstoreSize()))) {
LOG.info("Refreshing storefiles of region " + bestRegionReplica +
" due to global heap pressure. memstore size=" + StringUtils.humanReadableInt(
server.getRegionServerAccounting().getGlobalMemstoreSize()));
flushedOne = refreshStoreFilesAndReclaimMemory(bestRegionReplica);
if (!flushedOne) {
LOG.info("Excluding secondary region " + bestRegionReplica +
" - trying to find a different region to refresh files.");
excludedRegions.add(bestRegionReplica);
}
} else {
LOG.info("Flush of region " + regionToFlush + " due to global heap pressure. "
+ "Total Memstore size="
+ humanReadableInt(server.getRegionServerAccounting().getGlobalMemstoreSize())
+ ", Region memstore size="
+ humanReadableInt(regionToFlush.getMemstoreSize()));
/**強制刷新這個region下面的所有hstore對應的memstore */
flushedOne = flushRegion(regionToFlush, true, true);
if (!flushedOne) {
LOG.info("Excluding unflushable region " + regionToFlush +
" - trying to find a different region to flush.");
excludedRegions.add(regionToFlush);
}
}
}
至此,整個HRegionServer級別的memstore flush操作觸發大概過程已經分析完,如果不對,歡迎指正。