Improving Hive Performance with S3/ADLS/WASB

原創

2020-02-22 19:58

Tune the following parameters to improve Hive performance when working with S3, ADLS or WASB.

Table 7.1. Improving General Performance

Parameter	Recommended Setting
`yarn.scheduler.capacity.node-locality-delay`	Set this to "0".
`hive.warehouse.subdir.inherit.perms`	Set this to "false" to reduce the number of file permission checks.
`hive.metastore.pre.event.listeners`	Set this to an empty value to reduce the number of directory permission checks.

You can set these parameters in hive-site.xml.

Table 7.2. Accelerating ORC Reads in Hive

Parameter Recommended Setting

Parameter	Recommended Setting
`hive.orc.compute.splits.num.threads`	If using ORC format and you want improve the split computation time, you can set the value of this parameter to match the number of available processors. By default, this parameter is set to 10. This parameter controls the number of parallel threads involved in computing splits. For Parquet computing splits is still single-threaded, so split computations can take longer with Parquet and S3/ADLS/WASB.
`hive.orc.splits.include.file.footer`	If using ORC format with ETL file split strategy, you can set this parameter to "true" in order to use existing file footer information in split payload.

hive.orc.compute.splits.num.threads

If using ORC format and you want improve the split computation time, you can set the value of this parameter to match the number of available processors. By default, this parameter is set to 10.

This parameter controls the number of parallel threads involved in computing splits. For Parquet computing splits is still single-threaded, so split computations can take longer with Parquet and S3/ADLS/WASB.

hive.orc.splits.include.file.footer If using ORC format with ETL file split strategy, you can set this parameter to "true" in order to use existing file footer information in split payload.

You can set these parameters using --hiveconf option in Hive CLI or using the set command in Beeline.

Table 7.3. Accelerating ETL Jobs

Parameter Recommended Setting

Parameter	Recommended Setting
`hive.metastore.fshandler.threads`	Query launches can be slightly slower if there are no stats available or when `hive.stats.fetch.partition.stats=false`. In such cases, Hive ends up looking at file sizes for every file that it tries to access. Tuning `hive.metastore.fshandler.threads` helps reduce the overall time taken for the metastore operation.
`fs.trash.interval`	Drop table can be slow in object stores such as S3 because the action involves moving files to trash (a copy + delete). To remedy this, you can set `fs.trash.interval=0` to completely skip trash.

hive.metastore.fshandler.threads

Query launches can be slightly slower if there are no stats available or when hive.stats.fetch.partition.stats=false. In such cases, Hive ends up looking at file sizes for every file that it tries to access.

Tuning hive.metastore.fshandler.threads helps reduce the overall time taken for the metastore operation.

fs.trash.interval Drop table can be slow in object stores such as S3 because the action involves moving files to trash (a copy + delete). To remedy this, you can set fs.trash.interval=0 to completely skip trash.

You can set these parameters using --hiveconf option in Hive CLI or using the set command in Beeline.

Accelerating Inserts in Hive

When inserting data, Hive moves data from a temporary folder to the final location. This move operation is actually a copy+delete action, which is expensive in object stores such as S3; the more data is being written out to the object store, the more expensive the operation is.

To accelerate the process, you can tune hive.mv.files.thread, depending on the size of your dataset (default is 15). You can set it in hive-site.xml.