What Is New About NewSQL(NewSQL的獨到之處)?

文本源自:https://softwareengineeringdaily.com/2019/02/24/what-is-new-about-newsql/

                  https://cloud.tencent.com/developer/article/1445846

 

By Gokhan Simsek

 Article Sunday, February 24 2019

 

Most programmers are familiar with SQL and the relational database management systems, or RDBMSs, like MySQL or PostgreSQL. The basic principles for such architectures have been around for decades. Around 2000s came NoSQL solutions, like MongoDB or Cassandra, developed for distributed, scalable data needs.

But, for the past few years, there has been a new kid on the block: NewSQL.

NewSQL is a new approach to relational databases that wants to combine transactional ACID (atomicity, consistency, isolation, durability) guarantees of good ol’ RDBMSs and the horizontal scalability of NoSQL. It sounds like a perfect solution, the best of both worlds. What took it so long to arrive?

 

Databases were born out of a need to separate code from data in the mid-1960s. These first databases were designed with several considerations:

  1. The number of users querying the database is limited.
  2. The types of queries are unlimited – the developer can use any query they want.
  3. Hardware is quite expensive.

In those days of developers entering interactive queries to a terminal, as the only users with access to the database, these considerations were relevant and valuable. Correctness and consistency were the two important metrics, rather than today’s metrics of performance and availability. Vertical scaling was the solution to growing data needs, and downtime needed for the data to be moved in case of database migration or recovery was bearable.

Fast forwarding a couple of decades, the requirements from databases in the Internet and cloud era are much more different. The scale of data is enormous, and commodity hardware is much cheaper compared to the 20th-century costs.

As the scale of data grew and real-time interactions through Internet became widespread, basic needs from databases started to be divided into the two main categories of OLAP and OLTP, Online Analytical Processing and Online Transaction Processing, respectively.

OLAP databases are commonly known as data warehouses. They store a historical footprint for statistical analysis purposes in business intelligence operations. OLAP databases are thus focused on read-only workloads with ad-hoc queries for batch processing. The number of users querying the database is considerably low, as usually, only the employees of a company have access to the historical information.

OLTP databases correspond to the highly concurrent, transactional data processing, characterized by short-lived and pre-defined queries enacted by real-time users. Searches a regular user does on an e-commerce website and buying of items are basic examples of transactional processing. While the users access a smaller subset of the data when compared with OLAP users, the number of users are considerably higher and the queries can include both read and write operations. The important considerations in OLTP databases thus are high availability, concurrency, and performance.

For most websites, for any given time, there are hundreds or thousands of users effectively querying the database concurrently. With this scale in mind, the system needs to be highly available, as every minute of downtime can cost the bigger companies thousands or even millions of dollars.

On websites, the queries made by the users are pre-defined; the users do not have access to the terminal of the database to execute any query that they’d like. The queries are buried in the application logic. This allows for optimizations towards high performance.

In the new database ecosystem where scalability is an important metric, and high availability is essential for making profits, NoSQL databases were offered as a solution for achieving easier scalability and better performance, opting for an AP design from the CAP theorem. However, this meant giving up strong consistency and the transactional ACID properties offered by RDMBSs in favor of eventual consistency in most NoSQL designs.

 

NoSQL databases use a different model than the relational, such as key-value, document, wide-column, or graph. With these models, NoSQL databases are not normalized, and are inherently schemaless by design. Most NoSQL databases support auto-sharding, allowing for easy horizontal scaling without developer intervention.

NoSQL can be useful for applications such as social media, where eventual consistency is acceptable – users do not notice if they see a non-consistent view of the database, and since the data involves status updates, tweets, etc. strong consistency is not essential. However, NoSQL databases are not easy to use for systems where consistency is critical, such as e-commerce platforms.

NewSQL systems are born out of the desire to combine the scalability and high availability of NoSQL alongside the relational model, transaction support, and SQL of traditional RDBMSs. The one-size-fits-all solutions are at an end, and specialized databases for different workloads like OLTP started to rise. Most NewSQL databases are born out of a complete redesign focused heavily on OLTP or hybrid workloads.

Traditional RDMBS architecture was not designed with a distributed system in mind. Rather, when the need arose, support for distributed designs was built as an afterthought on top of the original design. Due to their normalized structure, rather than the aggregated form of NoSQL, RDBMS had to introduce complicated concepts to both scale out and conserve its consistency requirements. Manual sharding  and master-slave architectures were developed to allow horizontal scaling.

However, RDBMS loses much of its performance when scaling out, as joins become more costly with moving data between different nodes for aggregation, and maintenance overhead became time consuming. To preserve the performance, complex systems and products were developed – but today, still, traditional RDBMSs are not regarded as inherently scalable.

NewSQL databases are built for the cloud era, with a distributed architecture in mind from the start.

 

What are the different characteristics observed in NewSQL solutions?

Consistency:

Favoring consistency over availability, CP from CAP, most NewSQL databases offer strong consistency by sacrificing some availability. Using consensus protocols such as Paxos or Raft, from a global system or local partition level, these databases are able to achieve consistency. Some solutions, such as MemSQL, also offer tuning the tradeoff between consistency and availability, allowing for different configurations in different use cases.

Main Memory:

Traditional RDBMSs rely on secondary storage, or disk, as the medium for storing data, most commonly SSDs or HDDs. Since OLTP workloads do not require as much data, as the historical data can be archived in data warehouses and only the more current information is needed, a couple of NewSQL solutions use main memory (RAM) as storage. Memory access is significantly faster than disk access, almost 100 times faster than SSD, and 10.000 times faster than HDD.

In-memory solutions offer the added performance boosts of eliminating or simplifying heavy concurrency systems and especially buffer managers.

Since all the data (or most of it) is already in the main memory, buffer managers become obsolete. As for concurrency, different solutions exist in different implementations, e.g. serialization.

What about persistence? RAMs are, by nature, volatile. When power is lost, the data that needs to persist can be lost. In-memory databases alleviate this in different ways, usually by combinations of infrequent backups on disks, logging for preserving state and for recoverability, or by utilizing non-volatile RAMs for critical data.

The two main examples of in-memory NewSQL solutions are VoltDB and MemSQL.

 

VoltDB

VoltDB is an in-memory ACID-compliant relational database. VoltDB’s architecture is based on H-Store, designed by Michael Stonebraker et. al., an in-memory database designed for OLTP workloads.

VoltDB is focused on fast data and is built to serve the specific applications where large streams of data must be processed quickly, such as trading applications, online gaming, IoT sensors, and more. Fitting with the OLTP principles, VoltDB is designed from scratch to be performant.

With the conscious decision of having only stored procedures and moving them closer to the data, VoltDB can execute serialized transactions. The procedures are broken up into atomic transactions, and these transactions, in turn, are serialized and performed from a queue. This serialized transaction scheme gets rid of the overhead for managing concurrency, improving performance. While VoltDB also supports ad-hoc queries, these stored procedures are the ones that benefit from performance optimizations. This fits well with the OLTP workloads, as the end-user cannot execute ad-hoc queries.

For in-memory databases, an important question, and one of the requirements for ACID principles is durability. VoltDB achieves durability through various techniques, including snapshots, command logging, K-safety, and database replication. With these approaches, VoltDB ensures redundancy and allows for durable data.

If you want more information on VoltDB and its architecture, you can check our past shows with John Hugg and with Ryan Betts.

 

HTAP

As I pointed out before, most NewSQL databases are designed from scratch. With the possibilities such an endeavor brings, some projects wanted to bring a unified database, where transactional and analytical workloads can be handled. The term Hybrid Transactional/Analytical Processing, or HTAP, was coined by Gartner. HTAP capabilities in a database enable advanced real-time analytics and can lead to real-time business decisions and intelligent transactional processing. While VoltDB also offers HTAP capabilities, it focuses more on transactional workloads. Other notable HTAP databases include TiDB and Google’s Spanner.

 

TiDB

An open-source solution to come out of China, TiDB is a strongly consistent distributed scalable MySQL-compatible HTAP database. TiDB has a layered architecture: TiDB server sits on top, as a stateless computing layer. Underlying storage model comes to life in TiKV, a transactional key-value database inspired by Google’s Spanner.

 

TiDB layer listens to SQL queries, parses them and creates an execution plan. The query is then, if desirable, split into parts and sent to corresponding TiKV stores. Since it is stateless, it’s easy to scale the TiDB layer.

TiKV is the underlying storage layer, a key-value database using RocksDB for physical storage. TiKV organizes data by regions: these regions are stored and replicated. To achieve the durability and high availability with this replication scheme, TiKV utilizes the Raft consensus algorithm for strong consistency. The distributed nature of TiKV allows for distributed queries.

What enables TiDB to be powerful in both OLTP and OLAP situations is its decoupled architecture: the computation layer is different from the storage layer. While TiDB can handle both OLTP and simple OLAP workloads, TiSparkis an OLAP solution that runs Spark SQL directly on TiKV and can be added easily to the TiDB/TiKV architecture. TiDB on its own, through its cost optimizer and distributed executor can handle 80% of ad-hoc OLAP queries.

TiSpark is optimized for complex OLAP queries. Just like TiDB, TiSpark is also a stateless compute layer that communicates with TiKV, however it’s designed to handle complex OLAP queries, and communicates using Spark SQL.

So, deploying both TiDB and TiSpark results in eliminating ETL costs and allowing for a unified solution for both analytical and transactional needs.

Check out our recent episode on TiDB with Kevin Xu for more information about TiDB and its architecture; our episode on RocksDB with Dhruba Borthakur and Igor Canadi, for more information about the physical data store RocksDB that powers TiKV and TiDB, and our article on Chinese open source projects for more information about TiKV.

 

Cosmos DB

Azure Cosmos DB from Microsoft is a highly flexible solution, and through numerous tuneable features that can be tweaked to fit various use cases, it can be considered as a NewSQL database.

Cosmos DB is a globally distributed, multi-model database service. As a multi-model service, it supports key-value, column-family, document, and graph databases as the underlying storage models. The API with which the data is exposed can be both SQL and and NoSQL APIs.

With global distribution, Cosmos DB holds replicas of the data in several data centers around the world, ensuring reliability and high availability. The developer can create replicas and horizontally scale their data with a few simple API calls.

Cosmos DB is designed to alleviate the costs of database management. The developers don’t need to deal with index or schema management, as Cosmos DB handles indexing automatically to ensure performance.

Through several consistency levels, Cosmos DB lets developers decide the trade-offs that they want to make with appropriate SLAs. Instead of the two extreme ends of strong consistency and eventual consistency, there are five well-defined consistency levels alongside the spectrum. Each consistency level comes with a separate SLA, ensuring certain levels of availability and performance.

 

Being the product of a tech and cloud giant, Cosmos DB is simple for developers to use, and gives comprehensive guarantees for performance, availability, and consistency.

 

Augmenting RDBMS

NewSQL can also come in the form of augmenting existing RDBMSs to give them the ability to scale-out. Without a completely redesigned database, these solutions are implemented on top of an already battle-tested SQL database to enhance their capabilities. This idea is useful for large enterprises that have an established system and not willing to migrate to a new database solution.

 

Citus

A successful example that builds upon PostgreSQL is Citus.

Citus Data, recently acquired by Microsoft, develops and maintains Citus: an open-source PostgreSQL extension that allows for a distributed PostgreSQL by transparently distributing tables and queries to support horizontal scaling.

In a cluster managed by Citus, the tables are distributed: tables are horizontally partitioned across different worker nodes, and appear as normal SQL tables. The coordinator, having a table metadata to oversee the worker PostreSQL nodes, handles query processing and parallelizes the queries to the appropriate table partitions.

 

By adding features such as query routing, distributed tables and distributed transactions, and stored procedures, Citus takes care of the numerous low-level details to present a horizontally scalable, performant PostgreSQL.

Check out our episodes on Scaling PostgreSQL with Ozgun Erdogan and Postgres Sharding with Marco Slot for more information about Citus.

 

Vitess

While Citus builds upon PostgreSQL, Vitess is built to enhance MySQL, and make it fit to the current requirements of the cloud age.

Vitess was built first at Youtube for their scaling needs in 2011. With a growing user base and data, horizontal scaling and sharding became necessary, and Vitess was created to handle this scaling transparently. It has been open-sourced, and is now hosted under the CNCF. Getting the stamp of approval as a cloud-native technology, Vitess provides several improvements to MySQL.

First improvement is the introduction of various sharding schemas. Users can create their own sharding schemas, and Vitess is responsible for organizing the shards and the data accordingly. Vitess allows for automatic sharding without requiring manual application code, and enables live (re)sharding with minimal read-only down time.

Sharding is done through Vindexes and keyspaces. A Primary Vindex is similar to a primary index used in the indexing schemes of databases. User can specify the attribute they want as the Primary Vindex, and how many different shards the data can be split based on this vindex. After the database is sharded, the queries based on keyspaces are directed to the appropriate shards.

Vitess’s architecture provides load-balancing and query routing through vtgates. Since these gates are stateless layers, they can be easily scaled up and down. In turn, these vtgates route queries to vtablets that are proxies into shards, which return the aggregated result to vtgates.

 

Vitess retains all its benefits when deployed on a cluster orchestration tool like Kubernetes. Since the vtgates act as stateless proxies, they are suitable for deployment on a container cluster. lockserver or etcd acts as the metadata store, and handles the administrative work such as schema definitions.

Implemented in Go, Vitess can handle thousands of connections using Go’s concurrency support.

Listen to our episode on Vitess with Sugu Sougoumarane for deeper discussions on Vitess’ history, architecture, and use cases.

The NewSQL ecosystem is constantly growing and evolving. While it is almost impossible to make a general definition or come up with general characteristics that can encapsulate all NewSQL databases, the distinctive database designs that come out as a result under the umbrella of NewSQL add to the range of options that developers can choose from for specific use cases. One-size-fits-all architectures are not desirable anymore, and NewSQL is the movement towards innovation and specialized database designs.

Gokhan Simsek

Eindhoven, The Netherlands

Gokhan is a computer science graduate, currently pursuing a MSc. degree in Data Science at Eindhoven University of Technology. He’s interested in big data, NLP, and machine learning.

 

導讀:NewSQL 是一種新方式關係數據庫,意在整合 RDBMS 所提供的 ACID 事務特性,及 NoSQL 提供的橫向可擴展性。本文通過對 VoltDB、TiDB、Cosmos DB、Citus、Vitess 等 NewSQL 數據庫的介紹,給出了 NewSQL 的獨到考慮。

作者:Gokhan Simsek

譯者:蓋磊

來源:AI 前線(ID:ai-front)

 

對大多數開發人員而言,SQL 以及 MySQLPostgreSQL 等關係數據庫管理系統(即 RDBMS)並不陌生。RDBMS 的基本架構原則已歷經了數十年的發展。而 MongoDB、Cassandra 等 NoSQL 解決方案,則是在本世紀初爲滿足數據分佈可擴展的需求而提出的。

但是最近幾年我們看到,出現了一個稱爲 NewSQL 的新方向。

NewSQL 是一種新方式關係數據庫,意在整合 RDBMS 所提供的 ACID 事務特性(即原子性、一致性、隔離性和可持久性),以及 NoSQL 提供的橫向可擴展性。聽上去 NewSQL 應該汲取了這兩個方向各自的長處,像是一種完美的解決方案。那它爲什麼時至今日方得以推出呢?

 

數據庫的推出,源自於上世紀六十年代分離代碼與數據的需求。數據庫的最初設計基於如下考慮:

  1. 數據庫的查詢用戶數量有限。
  2. 查詢類型不受限,即開發人員可以給出任何所需類型的查詢。
  3. 硬件的價格昂貴。

在當時,開發人員需要通過終端輸入交互式查詢。鑑於開發人員是唯一能訪問數據庫的用戶,上面的考慮是有意義,且有價值的。正確性和一致性曾是戶最爲看重的兩個度量,但是時至今日人們更看重的是性能和可用性。由此,縱向擴展可用於解決不斷增加的數據需求,以及考慮在數據庫遷移或恢復時需移動數據的情況下的可承受宕機時間。

下面快進數十年進入當前的互聯網和雲時代,數據庫的需求已大爲不同。數據的規模是海量的,而商業硬件比起上世紀要更爲便宜。

隨着數據規模的增長,以及基於互聯網的實時交互無處不在,用戶對數據庫的基本需求呈現出兩個主要的類別,即 OLAP(在線分析處理)和 OLTP(在線交易處理)。

OLAP 數據庫通常稱爲數據倉庫。它們用於存儲供商業智能業務統計和分析歷史記錄。OLAP 數據庫側重於只讀工作負載,其中包括用於批處理的即席查詢。OLAP 數據庫的查詢用戶數相對較少,通常情況下只有企業員工可以訪問歷史記錄。

OLTP 數據庫用於高度併發的事務數據處理場景,該場景的特點是實時用戶提交預定義的短時查詢。事務處理的一個簡單例子,就是普通用戶在電子商務網站上搜索併購買商品。相對於 OLAP 用戶,儘管 OLTP 用戶訪問的數據集規模很小,但是用戶的數量要龐大很多,並且查詢中可以包括讀操作和寫操作。OLTP 數據庫主要考慮的是高可用性、併發性和性能。

在大多數 Web 站點上,任一時刻都可能會有成百上千的用戶併發執行有效的查詢。考慮到這樣的規模,系統必須具備高可用性,因爲每宕機一分鐘,都可能會導致企業損失數千甚至 上百萬美元。

Web 站點上用戶提交的查詢是預定義的,因爲用戶無法訪問數據庫終端並執行任意查詢。查詢是存在於應用邏輯中的,這使得我們可以針對高性能做優化。

可擴展性是這一新數據庫生態系統中的一個重要度量,而高可用性則對企業的盈利至關重要。NoSQL 數據庫給出了一種易於實現可擴展性和更好性能的解決方案,解決了 CAP 理論中的 A(可用性)和 P(分區容錯性)上的設計考慮。但這意味着,在很多 NoSQL 設計中實現爲 最終一致性,擯棄了 RDBMS 提供的強一致性及事務的 ACID 屬性。

 

NoSQL 數據庫使用了不同於關係模型的模型,例如鍵值模型、文檔模型、寬列模型和圖模型等。採用這些模型的 NoSQL 數據庫並不提供規範化,本身在設計上是無模式的。大多數 NoSQL 數據庫支持自動分區,無需開發人員干預即可輕鬆實現水平擴展。

NoSQL 適用於可接受最終一致性的部分應用,例如社交媒體。用戶並不關注看到的是否爲不一致的數據庫視圖,並且考慮到數據的狀態更新、發推文等,強一致性也並非必要的。但是,NoSQL 數據庫不宜用於對一致性要求高的系統,例如電子商務平臺。

NewSQL 系統的提出,正是爲了滿足整合 NoSQL 和 RDBMS 特性的需求。其中,NoSQL 提供了可擴展性和高可用性,傳統 RDBMS 提供了關係模型、ACID 事務支持和 SQL。用戶已不再考慮一招能解決所有問題(one-size-fits-all)的方案,逐漸轉向針對 OLTP 等不同工作負載給出特定數據庫。大多數 NewSQL 數據庫做了全新的設計,或是主要聚焦於 OLTP,或是採用了 OLTP/OLAP 的混合架構載的全新設計。

傳統的 RDBMS 架構從一開始設計時並未考慮分佈式系統,而是在分佈式需求出現後,才考慮在最初的設計之添加支持分佈式的設計。由於 RDBMS 實現了規範化模式,而非 NoSQL 那樣的聚合表單,因此 RDBMS 中必須引入一些複雜的概念,才能在支持可擴展的同時保持一致性需求。由此,爲支持 RDBMS 中的橫向擴展,人們提出了手動分片和主從架構。

但是,RDBMS 爲實現橫向擴展而在性能上做出了很大讓步。這是因爲連接運算中需要在各個節點間移動數據以實現聚合,運算實現代價增大。另外,數據維護開銷變得更爲耗時。爲保持 RDBMS 的性能,一些企業推出了複雜的系統和產品。但是當前,人們依然並不認爲傳統 RDBMS 本身支持可擴展。

NewSQL 數據庫爲雲時代而生,因此它從一開始就考慮了分佈式架構。

那麼 NewSQL 解決方案提供了那些獨到特性?

01 一致性

相對於可用性而言,NewSQL 更重視一致性,即側重 CAP 中的 C 和 P。很多 NewSQL 數據庫爲提供強一致性而犧牲了部分可用性。這些數據庫爲達成分佈式一致性,在全局系統或本地分區層面使用了 Paxos 或 Raft 共識協議。MemSQL 等一些解決方案還提供了一致性和可用性之間的權衡調優,支持不同用例的各種配置。

02 內存數據庫

傳統 RDBMS 依賴二級存儲(即磁盤)作爲數據存儲的介質。常用的二級存儲包括 SSD 或 HDD。鑑於 OLTP 工作負載可將歷史數據歸檔到數據倉庫中,因此並不需要大量的數據,只需要最新的數據。一些 NewSQL 解決方案使用內存(RAM)作爲存儲介質。內存訪問要比磁盤訪問快很多,具體而言,可比 SSD 快百倍,比 HDD 快萬倍。

內存解決方案提供了更好的性能提升,因爲內存的使用消除或簡化了緩存管理和重度併發系統。鑑於內存中保持了全部數據(或是大部分數據),因此完全沒有必要做緩存管理。對於併發而言,不同的實現有不同的解決方案,例如序列化等。

那麼如何解決持久性問題?RAM 本身是非持久介質。一旦掉電,需要持久化的數據就會丟失。內存數據庫採用了多種方式解決該問題。常用方法包括組合使用基於磁盤的非頻繁備份、保存狀態的日誌以實現可恢復性,以及對關鍵數據使用非易失 RAM 介質。

下面給出內存數據庫的兩個重要例子,VoltDB 和 MemSQL。

  • VoltDB

VoltDB 是一種符合 ACID 特性的內存關係數據庫。VoltDB 的架構基於 Michael Stonebraker 等提出的 H-Store,一種設計用於 OLTP 工作負載的內存數據庫。

VoltDB 關注快速數據,目的是服務於那些必須對大流量數據做快速處理的特定應用,例如貿易應用、在線遊戲、物聯網傳感器等 應用場景。爲實現高性能,VoltDB 基於 OLTP 原則做了全新的設計。

VoltDB 明確以支持存儲過程爲指導思想,讓存儲過程更接近於數據,因此 VoltDB 支持執行序列化事務。爲實現序列化事務處理,一個事務會被切分爲一些原子事務,然後做序列化,並在隊列中依次執行。序列化事務模式消除了管理併發的開銷,進而提高了性能。VoltDB 還支持即席查詢,性能優化可受益於存儲過程。這非常適合 OLTP 工作負載,因爲終端用戶並不能執行即席查詢。

ACID 原則中的持久性,對內存數據庫是一個重要問題。VoltDB 採用多種技術實現持久性,包括 快照、命令日誌、K-safety 機制和數據庫複製等。這些方法確保 VoltDB 實現數據冗餘,進而支持數據持久化。

如需進一步瞭解 VoltDB 及其架構,可查看我們前期對 John Hugg 和 Ryan Betts 訪談的播客。

 

03 HTAP 特性

前文曾提及,很多 NewSQL 數據庫是完全重新設計的。正因爲重新設計,一些項目希望實現統一支持事務處理和工作負載分析的數據庫。HTAP(混合事務 / 分析處理,Hybrid Transactional/Analytical Processing)一詞由 Gartner 提出。支持 HTAP 功能的數據庫提供對高級實時分析,進而支持實時業務決策和智能事務處理。

VoltDB 也提供 HTAP 能力,它更側重於事務負載。其他主流 HTAP 數據庫還包括 TiDB 和 Google 的 Spanner。

 

1. TiDB

TiDB 是一款來自中國的開源解決方案,它給出了一種兼容 MySQL 的 HTAP 數據庫,支持強一致性,並且分佈式可擴展。TiDB 實現爲分層架構,其中 TiDB 服務器作爲無狀態計算層出於頂層。底層存儲層實現爲支持事務的鍵值數據庫,稱爲 TiKV。TiKV 的設計受到了 Google Spanner 的啓發。

 

TiDB 層實現監聽 SQL 查詢、解析查詢並創建執行計劃。查詢進而將按需切分爲各個子查詢,併發送給相應的 TiKV 存儲。鑑於 TiDB 層是無狀態的,因此該層易於實現擴展。

TiKV 層實現了底層存儲層,它是一種使用 RocksDB 作爲物理存儲的鍵值數據庫。TikV 按區域組織數據,各個區域將被存儲和複製。爲基於複製模式實現持久性和高可用性,TiKV 使用 Raft 共識算法提供強一致性。TiKV 的分佈本質提供了對分佈式查詢的支持。

這一計算層與存儲層的分離解耦架構,使得 TiDB 可同時提供對 OLTP 和 OLAP 強大支持。鑑於 TiDB 同時支持處理 OLTP 和基本 OLAP 負載,TiSpark 作爲一種在 TiKV 上直接運行 Spark SQL 的 OLAP 解決方案,可輕易實現基於 TiDB/TiKV 架構的運行。TiDB 本身就具有代價優化器和分佈式執行器,可處理 80% 的即席 OLAP 查詢。

TiSpark 針對複雜 OLAP 查詢做了一些優化。和 TiDB 層類似,TiSpark 也是一種無狀態計算層,並與 TiKV 層交互。TiSpark 在設計上就是通過與 Spark SQL 的交互去處理複雜 OLAP 查詢。因此,同時部署 TiDB 和 TiSpark 可消除 ETL 的代價,給出一種同時支持分析和事務需求的統一解決方案。

要了解 TiDB 及其架構的更多信息,可查看 我們近期對 Kevin Xu 關於 TiDB 的訪談。要進一步瞭解支持 TiKV/TiDB 的數據物理存儲 RockDB,可查看 我們對 Dhruba Borthakur 和 Igor Canadi 關於 RocksDB 的訪談。要深入瞭解 TiKV,可查看 我們對中國開源項目的報道。

2. Cosmos DB

微軟的 Azure Cosmos DB 提供了多種可調優特性,是一種高度靈活的解決方案,可通過調整適合多類用例。我們認爲 Cosmos DB 也是 NewSQL 數據庫。

Cosmos DB 是一種分佈於全球的 多模型數據庫 服務。作爲多模型服務,它的底層存儲模型支持鍵值、列存儲、文檔和圖數據庫,並支持通過 SQL 和 NoSQL API 提供數據。

就全球分佈而言,Cosmos DB 在位於全球的多個數據中心保存數據備份,確保了可靠性和高可用性。開發人員可以創建備份,並通過幾個基本的 API 調用實現數據的橫向擴展。

Cosmos DB 在設計上考慮了降低數據庫管理的代價。它無需開發人員操心索引或模式管理,自動維護索引以確保性能。

Cosmos DB 提供多個一致性層級,支持開發人員在確定所需的適用 SLA 上做出權衡。除了兩種極端的強一致性情況和最終一致性之外,Cosmos DB 還一併提供了另外五個良好定義的一致性層級。每個一致性層級提供單獨的 SLA,確保達到特定的可用和性能層級。

 

作爲微軟這樣的技術和雲巨頭所提供的產品,Cosmos DB 易於開發人員使用,對性能、可用性和一致性提供了全面的保證。

04 增強 RDBMS

NewSQL 也可以通過增強現有的 RDBMS 實現擴展的功能,無需完全重新設計數據庫。這樣的解決方案實現在經實戰驗證的 SQL 數據庫之上,增強了現有數據庫的功能。該理念對於那些現有系統運行良好而不願意遷移到新數據庫解決方案的大型企業是非常有用的。

1. Citus

一個很好的例子,就是構建於 PostgreSQL 上的 Citus。

Citus 由近期被 微軟併購 的 Citus Data 開發維護。它是一款開源 PostgreSQL 擴展,通過透明分佈式表和查詢支持橫向擴展,進而支持分佈式 PostgreSQL。

在 Citus 集羣中,數據庫表是分佈式的。數據庫表被水平分區到不同的工作節點上,在用戶看來與常規數據庫表並無二致。Citus 使用一種維護了數據庫表元數據的協調器掌握 PostgreSQL 節點的工作情況,處理查詢,並將查詢並行化到適當的表分區。

 

Citus 爲 PostgreSQL 添加了查詢路由、分佈式表、分佈式事務和存儲過程等特性,管理了大量的底層細節,進而實現了水平可擴展、高性能的 PostgreSQL。

要了解 Cirus 的更多細節,可查看 我們就 PostgreSQL 擴展對 Ozgun Erdogan 的訪談,以及 就 Postgres 分片對 Marco Slot 的訪談。

2. Vitess

相對於 Citus 是基於 PostgreSQL 構建的,Vitess 在設計上考慮對 MySQL 做出改進,滿足 MySQL 適用於雲時代的需求。

Vitess 最初是由 Youtube 在 2011 年爲適應自身擴展需求而構建的。隨着用戶和數據的增長,Youtube 必須要進行水平擴展和分片,由此創建了 Vitess 解決透明擴展的問題。現在 Vitess 已經開源,由 CNCF 管理。Vitess 被認可爲是一種雲原生技術,提供了 多處 MySQL 改進。

首要改進就是引入了多種分片模式。用戶可以創建自己的分片模式,Vitess 負責依模式組織分片和數據。Vitess 也支持自動分片,無需手工運行代碼,並支持只讀宕機時間最小化的實時重分片。

分片是通過 V 索引(Vindex)和鍵空間(keyspace)技術實現的。其中,主 V 索引(Primary Vindex)類似於數據庫索引模式中的主鍵索引。用戶可以指定需要建立主 V 索引的屬性,以及基於 V 索引的數據分片數量。在對數據庫分片後,基於鍵空間的查詢可被導向到相應的分片。

Vitess 的架構 使用 vtgate 提供負載均衡和查詢路由。vtgate 是一種無狀態層,可輕易地上下擴展。vtgate 將查詢路由至爲分片提供代理的 vtable,並返回聚合結果給 vtgates。

 

當部署到 Kubernetes 等集羣編排工具上時,Vitess 依然提供上述優點。由於 vtgates 是一種無狀態代理,因此適合於部署到容器集羣上。這時 Vitess 使用 lockserver 或 etcd 作爲元數據存儲,處理模式定義等管理工作。

Vitess 用 Go 語言實現。利用 Go 對併發的良好支持,它支持對數千連接的處理。

05 結束語

NewSQL 生態系統正在持續增長和演進。我們無法給出一個能描述全部 NewSQL 數據庫的通用定義,或是提出一些通用的特徵。但是在 NewSQL 概念下提出的多種數據庫設計,爲開發人員提供了針對不同用例的多種選項。人們不再寄希望於給出適用於所有用例的單一架構,NewSQL 推動了創新和專業數據庫設計的發展。

關於作者,Gokhan 是一名計算機科學研究生,目前就讀於埃因霍溫技術大學數據科學專業。他的興趣包括大數據、NLP 和機器學習。

查看英文原文:

What Is New About NewSQL?

https://softwareengineeringdaily.com/2019/02/24/what-is-new-about-newsql/

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章