hbase 權威指南翻譯 之 chapter1 Introduction

Chapter 1. Introduction
Before we start looking into all the moving parts of HBase, let us
pause to think about why there was a need to come up with yet
another storage architecture. Relational database management
systems (RDBMS) have been around since the early 1970s, and have
helped countless companies and organizations to implement their
solution to given problems. And they are equally helpful today. There
are many use-cases for which the relational model makes perfect
sense. Yet there also seem to be specific problems that do not fit this

model very well. [4]

第一章  hbase 簡介

在我們開始調查研究hbase的移動部件之前,讓我們停下來思考爲什麼我們需要提出另外一個存儲架構。

關係型數據庫管理系統早在1970年就已出現,並且已經幫助無數的公司和組織去實施鑑於他們問題的解決方法。

並且到目前位置它們(RDBMS)也非常有用。許多關係型模型的用例非常有用,但是還是有許多問題不太適合這個模型。

The Dawn of Big Data
We live in an era in where we are all connected over the Internet and
expect to find results instantaneously, whether the question concerns
the best turkey recipe or what to buy mom for her birthday. We also
expect the results to be useful and tailored to our needs.
Because of this, companies have become focused on delivering more
targeted information, such as recommendations or online ads, and
their ability to do so directly influences their success as a business.
Systems like Hadoop5]now enable them to gather and process
petabytes of data, and the need to collect even more data continues
to increase with, for example, the development of new machine
learning algorithms.

Where previously companies had the liberty to ignore certain data
sources because there was no cost-effective way to store all that
information, they now are likely to lose out to the competition. There
is an increasing need to store and analyze every data point they
generate. The results then feed directly back into their e-commerce
platforms and may generate even more data.

大數據的黎明

我們生活在一個互聯網的時代並且期望迅速找到我們的答案,

然而無論這個問題關注的是最好的火雞食譜還是爲媽媽的生日準備什麼禮物。

我們同樣期望這個答案是有用的並且契合我們的需要。

正因如此,很多公司已經關注提供更多有針對性的信息,比如說建議或在線廣告。他們的能力可以直接影響他們的成敗。

比如說hadoop系統,現在可以使他們有能力收集和處理PB級的數據,並且支持更多數據的持續增加,比如開發一個新的機器學習算法。

在以前公司有自由忽略一些數據源,因爲他們沒有廉價划算的方式去存儲所以的信息,他們現在很可能輸給對手。

存儲和分析產生的數據的需求在不斷增長,分析的結果重新返回至他們的電子商務平臺並且可能產生更多的數據。

In that past, the only option to retain all the collected data was by
pruning it, to, for example, retain the last N days. While this is a viable
approach in the short term, it lacks the opportunities having all the
data, which can be months and years, offers: you can build
mathematical models that span the entire time range, or amend an
algorithm to perform better and rerun it with all the previous data.

在過去,保留所有收集到數據的唯一選擇是去修剪它,例如,保留最後N天的數據。雖然在短期內這是一個可行的方案,但是卻失去了保留多年多月數據的機會。

你可以創建一個數學模型來橫跨整個時間範圍,或者更好的執行和重新運行它和前邊所有的數據。

Dr. Ralph Kimball, for example, states [6] that
"Data assets are [a] major component of the balance sheet, replacing
traditional physical assets of the 20th century"
and that there is a
"Widespread recognition of the value of data even beyond traditional
enterprise boundaries"
Google or Amazon are prominent examples for companies that
realized the value of data and started developing solutions to fit their need。

In a series of technical publications Google, for instance,
described a scalable storage and processing system, based on
commodity hardware. These ideas were then implemented outside of
Google as part of the open-source Hadoop project: HDFS and
MapReduce.

拉爾夫·金伯爾博士,例如,州[6]
“數據資產)就是一個主要組件的資產負債表,取代20世紀的傳統的實物資產“和有一個
“普遍價值的認識甚至超越了傳統的數據企業邊界”
谷歌或亞馬遜是突出的例子的公司
實現數據的價值,並開始開發適合他們需求的解決方案。

在google的一系列技術論文中描述了關於在普通廉價硬件設備上存儲和處理數據的系統。這些想法被google以外的開源項目hadoop(HDFS 和MapReduce)實現。


Hadoop excels at storing data of arbitrary, semi- or even unstructured
formats, since it lets you decide how to interpret the data at analysis
time, allowing you to change the way you classify the data at any time:
once you have updated the algorithms you simply run the analysis
again.
Hadoop also complements existing database systems of almost any
kind. It offers a limitless pool into which one can sink data and still pull
out what is needed when the time is right. It is optimized for large file
storage and batch oriented, streaming access. This makes analysis
easy and fast, but users also need access to the final data, not in
batch mode but using random access - this is akin to a full table scan
versus using indexes in database system.

We are used to querying databases when it comes to random access
for structured data. RDBMSs are the most prominent, but there are
also quite a few specialized variations and implementations, like
object-oriented databases. Most RDBMSs strive to implement Codd's
12 rules [7] which forces them to comply to very rigid requirements.
The architecture used underneath is well researched and has not
changed significantly in quite some time. The recent advent of
different approaches, like column-oriented or massively parallel
processing (MPP) databases, has shown that we can rethink the
technology to fit specific workloads, but most solutions still implement
all or the majority of Codd's 12 rules in an attempt to not break with
tradition.

Column-Oriented Databases
Column-oriented databases save their data grouped by columns.
Subsequent column values are stored contiguously on disk. This
differs from the usual row-oriented approach of traditional databases,
wh - see Figure 1.1,
“Column-oriented storage layouts differs from the row-oriented ones”
for a visualization of the different physical layouts.
The reason to store values on a per column basis instead is based on
the assumption that for specific queries not all of them are needed.

Especially in analytical databases this is often the case and therefore
they are good candidates for this different storage schema.
Reduced IO is one of the primary reasons for this new layout but it
offers additional advantages playing into the same category: since
the values of one column are often very similar in nature or even vary
only slightly between logical rows they are often much better suited
for compression than the heterogeneous values of a row-oriented
record structure: most compression algorithms only look at a finite
window.
Specialized algorithms, for example delta and/or prefix compression,
selected based on the type of the column (i.e. on the data stored) can
yield huge improvements in compression ratios. Better ratios result in
more efficient bandwidth usage in return.

Hadoop擅長存儲任意、半結構化甚至非結構化數據,它允許你決定如何解釋數據分析時間,允許你在任何時間改變你分類數據的方式:一旦你有了新的算法,只需要再運行分析一次。

Hadoop 同時也是幾乎其他所以數據庫系統的補充。它提供一個無限的pool 池,你可以在適當的時候進行數據存儲和提取。Hadoop適用於大型文件存儲和批量導入,流式訪問。這使得分析數據更加簡單和迅速。但是用戶同時需要訪問最終的數據,不是在批處理模式而是使用隨機的方式:這是一種類似於全表掃描的方式與使用索引在數據庫系統中。

當我們查詢數據庫中的結構化數據,那麼RDBMSs是最優秀的,但是也有一些特殊的實現,比如面向對象的數據庫。大多數據的關係型數據庫必須遵循科德十二定律。這種體系結構在很長一段時間內沒有發生變化,但是最近出現了用於大規模並行處理(MPP)的列式數據庫,這表明我們開始重新考慮這些架構來適應特定的工作負載,但是大多數的解決方案爲了不打破傳統,仍然遵循科德十二定律

列式數據庫

列式數據庫通過列族保存數據,然後將列值存儲在連續的磁盤上。這些不同於通常基於行的傳統數據庫。圖標1-1  列式數據庫與行式數據庫的不同。基於列的存儲假定在特殊的查詢,但不是所有的需求都這樣。

特別是在分析型數據庫,這些需求是常有的,所有非常適合這種不同的存儲模型。減少IO支出是這種新技術的一個主要原因,同時他也提供了額外的優勢:因爲一個列的值通常非常相似,或者稍微有些邏輯上的不同,所以他們更加適合壓縮比異構導向(這個地方翻譯的不清楚)。

特殊的算法,比如三角函數或前綴壓縮,選擇基於列的方式可以提高壓縮比。更好的壓縮比可以增加降低帶寬的壓力,提供傳輸的效率。

Note though that HBase is not a column-oriented database in the
typical RDBMS sense, but utilizes an on-disk column storage format.
This is also where the majority of similarities end, because although
HBase stores data on disk in a column-oriented format, it is distinctly
different from traditional columnar databases: whereas columnar
database excel at providing real-time analytical access to data, HBase
excels at providing key-based access to a specific cell of data, or a
sequential range of cells.

請注意儘管在典型關係型數據庫的觀念中HBase不是一個基於列的數據庫,但它可以利用於一個磁盤上的列存儲格式。這也是大多數數據庫的相似之處,雖然HBase也是通過列存儲數據,但它不同於傳統的列式數據庫:傳統的列式數據庫優勢在於提供real-time 分析訪問數據,而HBase優勢在於提供基於鍵來檢索一個特定的單元格或一個單元格範圍。




發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章