Improving Hive Performance with S3/ADLS/WASB

Tune the following parameters to improve Hive performance when working with S3, ADLS or WASB.

Table 7.1. Improving General Performance

Parameter Recommended Setting
yarn.scheduler.capacity.node-locality-delay Set this to "0".
hive.warehouse.subdir.inherit.perms Set this to "false" to reduce the number of file permission checks.
hive.metastore.pre.event.listeners Set this to an empty value to reduce the number of directory permission checks.

 

You can set these parameters in hive-site.xml.

Table 7.2. Accelerating ORC Reads in Hive

Parameter Recommended Setting
hive.orc.compute.splits.num.threads

If using ORC format and you want improve the split computation time, you can set the value of this parameter to match the number of available processors. By default, this parameter is set to 10.

This parameter controls the number of parallel threads involved in computing splits. For Parquet computing splits is still single-threaded, so split computations can take longer with Parquet and S3/ADLS/WASB.

hive.orc.splits.include.file.footer If using ORC format with ETL file split strategy, you can set this parameter to "true" in order to use existing file footer information in split payload.

 

You can set these parameters using --hiveconf option in Hive CLI or using the set command in Beeline.

Table 7.3. Accelerating ETL Jobs

Parameter Recommended Setting

hive.metastore.fshandler.threads

Query launches can be slightly slower if there are no stats available or when hive.stats.fetch.partition.stats=false. In such cases, Hive ends up looking at file sizes for every file that it tries to access.

Tuning hive.metastore.fshandler.threads helps reduce the overall time taken for the metastore operation.

fs.trash.interval Drop table can be slow in object stores such as S3 because the action involves moving files to trash (a copy + delete). To remedy this, you can set fs.trash.interval=0 to completely skip trash.

 

You can set these parameters using --hiveconf option in Hive CLI or using the set command in Beeline.

Accelerating Inserts in Hive

When inserting data, Hive moves data from a temporary folder to the final location. This move operation is actually a copy+delete action, which is expensive in object stores such as S3; the more data is being written out to the object store, the more expensive the operation is.

To accelerate the process, you can tune hive.mv.files.thread, depending on the size of your dataset (default is 15). You can set it in hive-site.xml.

發佈了127 篇原創文章 · 獲贊 76 · 訪問量 45萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章