Impala概念和架構 (三) ——Impala如何融入Hadoop生態系統(英文翻譯)

https://impala.apache.org/docs/build/html/topics/impala_hadoop.html#intro_metastore

Impala makes use of many familiar components within the Hadoop ecosystem. Impala can interchange data with other Hadoop components, as both a consumer and a producer, so it can fit in flexible ways into your ETL and ELT pipelines.

IMPALA利用Hadoop生態系統中許多熟悉的組件。作爲消費者和生產者,Impala可以與其他Hadoop組件交換數據,因此它可以以靈活的方式適應ETL和ELT管道。

Parent topic: Impala Concepts and Architecture    Impala 的概念和體系結構

How Impala Works with Hive

A major Impala goal is to make SQL-on-Hadoop operations fast and efficient enough to appeal to new categories of users and open up Hadoop to new types of use cases. Where practical, it makes use of existing Apache Hive infrastructure that many Hadoop users already have in place to perform long-running, batch-oriented SQL queries.

Impala的主要目標是使SQL-on-Hadoop操作足夠快速和高效,以吸引新類別的用戶,並向新類型的用例開放Hadoop。在實際應用中,它使用許多Hadoop用戶已經具備的現有Apache Hive基礎結構來執行長期運行的、面向批處理的SQL查詢。

In particular, Impala keeps its table definitions in a traditional MySQL or PostgreSQL database known as the metastore, the same database where Hive keeps this type of data. Thus, Impala can access tables defined or loaded by Hive, as long as all columns use Impala-supported data types, file formats, and compression codecs.

具體而言,Impala將其表定義保存在傳統的MySQL或PostgreSQL數據庫中,稱爲元存儲,與Hive保存此類數據的數據庫相同。因此,只要所有列都使用支持Impala的數據類型、文件格式和壓縮編解碼器,Impala就可以訪問由Hive定義或加載的表。

The initial focus on query features and performance means that Impala can read more types of data with the SELECT statement than it can write with the INSERT statement. To query data using the Avro, RCFile, or SequenceFile file formats, you load the data using Hive.

最初關注查詢特性和性能,意味着Impala可以使用SELECT語句讀取的數據類型多於比使用INSERT語句的數據類型。若要使用AVRO、RCfile或序列文件格式查詢數據,則使用HIVE加載數據。

The Impala query optimizer can also make use of table statistics and column statistics. Originally, you gathered this information with the ANALYZE TABLE statement in Hive; in Impala 1.2.2 and higher, use the Impala COMPUTE STATS statement instead. 

COMPUTE STATS requires less setup, is more reliable, and does not require switching back and forth between impala-shell and the Hive shell.

Impala 查詢優化器還可以使用表統計統計和列統計。最初,查詢優化器使用Hive中的ANALYZE TABLE語句收集此信息;在Impala 1.2.2和更高版本中,在Impala 中使用COMPUTE STATS語句代替。
COMPUTE STATS需要更少的設置,而且更加可靠,並且不需要在impala-shell 和Hive shell之間來回切換。

Overview of Impala Metadata and the Metastore

As discussed in How Impala Works with Hive, Impala maintains information about table definitions in a central database known as the metastore. Impala also tracks other metadata for the low-level characteristics of data files:

  • The physical locations of blocks within HDFS.

For tables with a large volume of data and/or many partitions, retrieving all the metadata for a table can be time-consuming, taking minutes in some cases. Thus, each Impala node caches all of this metadata to reuse for future queries against the same table.

正如Impala如何與Hive一起工作所討論的,Impala在稱爲元存儲的中央數據庫中維護關於表定義的信息。IMPLA還跟蹤數據文件的低級特性的其他元數據:

  • HDFS中塊的物理位置。

對於具有大量數據和/或許多分區的表,檢索表的所有元數據可能非常耗時(直接訪問元數據庫表),在某些情況下需要幾分鐘。因此,每個IMPLA節點緩存所有這些元數據,在將來對同一個表進行查詢時候可以重用。

If the table definition or the data in the table is updated, all other Impala daemons in the cluster must receive the latest metadata, replacing the obsolete cached metadata, before issuing a query against that table. In Impala 1.2 and higher, the metadata update is automatic, coordinated through the catalogd daemon, for all DDL and DML statements issued through Impala. See The Impala Catalog Service for details.

如果更新了表定義或表中的數據,則集羣中的所有其他Impala守護進程必須在對該表發出查詢之前接收到最新的元數據,替換過時的緩存元數據。在Impala 1.2和更高版本中,元數據更新是通過目錄守護進程自動協調產生的,源於通過Impala發佈的所有DDL和DML語句。詳情請參閱Impala目錄服務。

For DDL and DML issued through Hive, or changes made manually to files in HDFS, you still use the REFRESH statement (when new data files are added to existing tables) or the INVALIDATE METADATA statement (for entirely new tables, or after dropping a table, performing an HDFS rebalance operation, or deleting data files). Issuing INVALIDATE METADATA by itself retrieves metadata for all the tables tracked by the metastore. If you know that only specific tables have been changed outside of Impala, you can issue REFRESH table_name for each affected table to only retrieve the latest metadata for those tables.

對於通過Hive發佈的DDL和DML,或者手動修改HDFS中的文件,您仍然使用REFRESH語句(當新的數據文件被添加到現有表時)或者INVALIDATE METADATA語句(對於完全新的表,或者在刪除表之後,執行HDFS重新平衡操作,或刪除數據文件)。發佈INVALIDATE METADATA語句,可檢索所有被跟蹤記錄於元存儲中所有表的元數據。如果知道在Impala之外更改了特定的表,則可以針對每個受影響的表執行REFRESH table_name(表名),以便僅取得這些表的最新元數據。

How Impala Uses HDFS(Impala 如何使用HDFS)

Impala uses the distributed filesystem HDFS as its primary data storage medium. Impala relies on the redundancy provided by HDFS to guard against hardware or network outages on individual nodes. Impala table data is physically represented as data files in HDFS, using familiar HDFS file formats and compression codecs. When data files are present in the directory for a new table, Impala reads them all, regardless of file name. New data is added in files with names controlled by Impala.

Impala 使用分佈式文件系統HDFS作爲其主要數據存儲介質。Impala依賴於HDFS提供的冗餘,以防止單個節點上的硬件或網絡中斷。使用熟悉的HDFS文件格式和壓縮編解碼器將IMPLA表數據物理地表示爲HDFS中的數據文件。當數據文件存在(出現)於新表的目錄中時,無論文件名如何,Impala都會讀取它們。新數據添加到文件中,名稱由Impala控制。

How Impala Uses HBase(Impala 如何使用HBase)

HBase is an alternative to HDFS as a storage medium for Impala data. It is a database storage system built on top of HDFS, without built-in SQL support. Many Hadoop users already have it configured and store large (often sparse) data sets in it. By defining tables in Impala and mapping them to equivalent tables in HBase, you can query the contents of the HBase tables through Impala, and even perform join queries including both Impala and HBase tables. See Using Impala to Query HBase Tables for details.
HBASE是作爲IMPRA數據存儲介質的HDFS的替代品。它是一個建立在HDFS之上的數據庫存儲系統,沒有內置SQL支持。許多Hadoop用戶已經配置並存儲了大量的(通常稀疏的)數據集。通過在Impala中定義表並將它們映射到HBase中的等效表,您可以通過Impala查詢HBase表的內容,甚至執行包括Impala表和HBase表的聯接查詢。有關詳細信息,請參閱使用Impala查詢HBASE表。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章