hbase 權威指南翻譯 之 chapter1 Introduction

Chapter 1. Introduction
Before we start looking into all the moving parts of HBase, let us
pause to think about why there was a need to come up with yet
another storage architecture. Relational database management
systems (RDBMS) have been around since the early 1970s, and have
helped countless companies and organizations to implement their
solution to given problems. And they are equally helpful today. There
are many use-cases for which the relational model makes perfect
sense. Yet there also seem to be specific problems that do not fit this

model very well. [4]

第一章  hbase 簡介




The Dawn of Big Data
We live in an era in where we are all connected over the Internet and
expect to find results instantaneously, whether the question concerns
the best turkey recipe or what to buy mom for her birthday. We also
expect the results to be useful and tailored to our needs.
Because of this, companies have become focused on delivering more
targeted information, such as recommendations or online ads, and
their ability to do so directly influences their success as a business.
Systems like Hadoop5]now enable them to gather and process
petabytes of data, and the need to collect even more data continues
to increase with, for example, the development of new machine
learning algorithms.

Where previously companies had the liberty to ignore certain data
sources because there was no cost-effective way to store all that
information, they now are likely to lose out to the competition. There
is an increasing need to store and analyze every data point they
generate. The results then feed directly back into their e-commerce
platforms and may generate even more data.









In that past, the only option to retain all the collected data was by
pruning it, to, for example, retain the last N days. While this is a viable
approach in the short term, it lacks the opportunities having all the
data, which can be months and years, offers: you can build
mathematical models that span the entire time range, or amend an
algorithm to perform better and rerun it with all the previous data.



Dr. Ralph Kimball, for example, states [6] that
"Data assets are [a] major component of the balance sheet, replacing
traditional physical assets of the 20th century"
and that there is a
"Widespread recognition of the value of data even beyond traditional
enterprise boundaries"
Google or Amazon are prominent examples for companies that
realized the value of data and started developing solutions to fit their need。

In a series of technical publications Google, for instance,
described a scalable storage and processing system, based on
commodity hardware. These ideas were then implemented outside of
Google as part of the open-source Hadoop project: HDFS and


在google的一系列技術論文中描述了關於在普通廉價硬件設備上存儲和處理數據的系統。這些想法被google以外的開源項目hadoop(HDFS 和MapReduce)實現。

Hadoop excels at storing data of arbitrary, semi- or even unstructured
formats, since it lets you decide how to interpret the data at analysis
time, allowing you to change the way you classify the data at any time:
once you have updated the algorithms you simply run the analysis
Hadoop also complements existing database systems of almost any
kind. It offers a limitless pool into which one can sink data and still pull
out what is needed when the time is right. It is optimized for large file
storage and batch oriented, streaming access. This makes analysis
easy and fast, but users also need access to the final data, not in
batch mode but using random access - this is akin to a full table scan
versus using indexes in database system.

We are used to querying databases when it comes to random access
for structured data. RDBMSs are the most prominent, but there are
also quite a few specialized variations and implementations, like
object-oriented databases. Most RDBMSs strive to implement Codd's
12 rules [7] which forces them to comply to very rigid requirements.
The architecture used underneath is well researched and has not
changed significantly in quite some time. The recent advent of
different approaches, like column-oriented or massively parallel
processing (MPP) databases, has shown that we can rethink the
technology to fit specific workloads, but most solutions still implement
all or the majority of Codd's 12 rules in an attempt to not break with

Column-Oriented Databases
Column-oriented databases save their data grouped by columns.
Subsequent column values are stored contiguously on disk. This
differs from the usual row-oriented approach of traditional databases,
wh - see Figure 1.1,
“Column-oriented storage layouts differs from the row-oriented ones”
for a visualization of the different physical layouts.
The reason to store values on a per column basis instead is based on
the assumption that for specific queries not all of them are needed.

Especially in analytical databases this is often the case and therefore
they are good candidates for this different storage schema.
Reduced IO is one of the primary reasons for this new layout but it
offers additional advantages playing into the same category: since
the values of one column are often very similar in nature or even vary
only slightly between logical rows they are often much better suited
for compression than the heterogeneous values of a row-oriented
record structure: most compression algorithms only look at a finite
Specialized algorithms, for example delta and/or prefix compression,
selected based on the type of the column (i.e. on the data stored) can
yield huge improvements in compression ratios. Better ratios result in
more efficient bandwidth usage in return.


Hadoop 同時也是幾乎其他所以數據庫系統的補充。它提供一個無限的pool 池,你可以在適當的時候進行數據存儲和提取。Hadoop適用於大型文件存儲和批量導入,流式訪問。這使得分析數據更加簡單和迅速。但是用戶同時需要訪問最終的數據,不是在批處理模式而是使用隨機的方式:這是一種類似於全表掃描的方式與使用索引在數據庫系統中。



列式數據庫通過列族保存數據,然後將列值存儲在連續的磁盤上。這些不同於通常基於行的傳統數據庫。圖標1-1  列式數據庫與行式數據庫的不同。基於列的存儲假定在特殊的查詢,但不是所有的需求都這樣。



Note though that HBase is not a column-oriented database in the
typical RDBMS sense, but utilizes an on-disk column storage format.
This is also where the majority of similarities end, because although
HBase stores data on disk in a column-oriented format, it is distinctly
different from traditional columnar databases: whereas columnar
database excel at providing real-time analytical access to data, HBase
excels at providing key-based access to a specific cell of data, or a
sequential range of cells.

請注意儘管在典型關係型數據庫的觀念中HBase不是一個基於列的數據庫,但它可以利用於一個磁盤上的列存儲格式。這也是大多數數據庫的相似之處,雖然HBase也是通過列存儲數據,但它不同於傳統的列式數據庫:傳統的列式數據庫優勢在於提供real-time 分析訪問數據,而HBase優勢在於提供基於鍵來檢索一個特定的單元格或一個單元格範圍。

還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.