Using a single hive warehouse for all EMR(Hadoop) clusters

s the EMR/Hadoop cluster’s are transient, tracking all those databases and tables across clusters may be difficult. So, Instead of having different warehouse directories across clusters, You can use a single permanent hive warehouse across all EMR clusters. S3 would be a great choice as it is persistent storage and had robust architecture providing redundancy and read-after-write consistency.

For each cluster:

This can be configured using hive.metastore.warehouse.dir property on hive-site.xml.

 

 

1

2

3

4

5

<property>

  <name>hive.metastore.warehouse.dir</name>

  <value>s3n://bucket/hive_warehouse</value>

  <description>location of default database for the warehouse</description>

</property>

 

You may need to update this setting on all nodes.

On a single hive session:

this can be configured using a command like set hive.metastore.warehouse.dir ="s3n://bucket/hive_warehouse"

or initialize hive cli with the following invocation -hiveconf hive.metastore.warehouse.dir=s3n://bucket/hive_warehouse

Note that using above configuration, all default databases and tables will be stored on s3 on path like s3://bucket/hive_warehouse/myHiveDatabase.db/

發佈了127 篇原創文章 · 獲贊 76 · 訪問量 45萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章