問題現象:hive查詢時生成了大量的map,損耗了過多的cpu資源,參數調配沒有生效
問題分析:
hive的map數 是由設定的inputsplit size來決定,hive封裝了hadoop給出了inputformat的接口,用於描述輸入數據的格式,並交由hive.input.format參數所決定,其中包含了兩種主要使用類型:
1:HiveInputFormat
2:CombineHiveInputFormat
對於combineHiveInputFormat的計算來說,經過的流程如下所述:(重點爲標紅字段)
public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException { //加載CombineFileInputFormatShim,這個類繼承了org.apache.hadoop.mapred.lib.CombineFileInputFormat CombineFileInputFormatShim combine = ShimLoader.getHadoopShims() .getCombineFileInputFormat(); if (combine == null) { //若爲空則採用HiveInputFormat的方式,下同 return super.getSplits(job, numSplits); } Path[] paths = combine.getInputPathsShim(job); for (Path path : paths) { //若是外部表,則按照HiveInputFormat方式分片 if ((tableDesc != null) && tableDesc.isNonNative()) { return super.getSplits(job, numSplits); } Class inputFormatClass = part.getInputFileFormatClass(); String inputFormatClassName = inputFormatClass.getName(); InputFormat inputFormat = getInputFormatFromCache(inputFormatClass, job); if (this.mrwork != null && !this.mrwork.getHadoopSupportsSplittable()) { if (inputFormat instanceof TextInputFormat) { if ((new CompressionCodecFactory(job)).getCodec(path) != null) //在未開啓hive.hadoop.supports.splittable.combineinputformat(MAPREDUCE-1597)參數情況下,對於TextInputFormat並且爲壓縮則採用HiveInputFormat分片算法 return super.getSplits(job, numSplits); } } //對於連接式同上 if (inputFormat instanceof SymlinkTextInputFormat) { return super.getSplits(job, numSplits); } CombineFilter f = null; boolean done = false; Path filterPath = path; //由參數hive.mapper.cannot.span.multiple.partitions控制,默認false;如果沒true,則對每一個partition創建一個pool,以下省略爲true的處理;對於同一個表的同一個文件格式的split創建一個pool爲combine做準備; if (!mrwork.isMapperCannotSpanPartns()) { opList = HiveFileFormatUtils.doGetWorksFromPath( pathToAliases, aliasToWork, filterPath); f = poolMap.get(new CombinePathInputFormat(opList, inputFormatClassName)); } if (!done) { if (f == null) { f = new CombineFilter(filterPath); combine.createPool(job, f); } else { f.addPath(filterPath); } } } if (!mrwork.isMapperCannotSpanPartns()) { //到這裏才調用combine的分片算法,繼承了org.apache.hadoop.mapred.lib.CombineFileInputFormat extends 新版本CombineFileInputformat iss = Arrays.asList(combine.getSplits(job, 1)); } //對於sample查詢特殊處理 if (mrwork.getNameToSplitSample() != null && !mrwork.getNameToSplitSample().isEmpty()) { iss = sampleSplits(iss); } //封裝結果返回 for (InputSplitShim is : iss) { CombineHiveInputSplit csplit = new CombineHiveInputSplit(job, is); result.add(csplit); } return result.toArray(new CombineHiveInputSplit[result.size()]); }
|
爲TextInputFormat,並非實際存儲的壓縮格式,因爲缺少相關參數設定造成了配置失效,開始走hiveinputformat計算 。 public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException { //掃描每一個分區 for (Path dir : dirs) { PartitionDesc part = getPartitionDescFromPath(pathToPartitionInfo, dir); //獲取分區的輸入格式 Class inputFormatClass = part.getInputFileFormatClass(); InputFormat inputFormat = getInputFormatFromCache(inputFormatClass, job); //按照相應格式的分片算法獲取分片 //注意:這裏的Inputformat只是old version API:org.apache.hadoop.mapred而不是org.apache.hadoop.mapreduce,因此不能採用新的API,否則在查詢時會報異常:Input format must implement InputFormat.區別就是新的API的計算inputsplit size(Math.max(minSize, Math.min(maxSize, blockSize))和老的(Math.max(minSize, Math.min(goalSize, blockSize)))不一樣; InputSplit[] iss = inputFormat.getSplits(newjob, numSplits / dirs.length); for (InputSplit is : iss) { //封裝結果,返回 result.add(new HiveInputSplit(is, inputFormatClass.getName())); } } return result.toArray(new HiveInputSplit[result.size()]); } |
解決方案: 設定hive.hadoop.supports.splittable.combineinputformat參數 相關全部參數: set hive.merge.mapfiles=true; set hive.hadoop.supports.splittable.combineinputformat=true; set hive.merge.size.per.task=2147483648; set hive.merge.smallfiles.avgsize=2147483648; set hive.merge.size.smallfiles.avgsize=2147483648; set mapreduce.input.fileinputformat.split.maxsize=2147483648; set mapred.max.split.size=2147483648; set mapred.min.split.size.per.node=2147483648; set mapred.min.split.size.per.rack=2147483648; set hive.exec.reducers.bytes.per.reducer=2147483648; set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; 驗證結果: map數由11236降低至2363個map,證明參數修改生效。 |
|
注:設定過程中需要注意調配輸入數據量大小,防止單個map輸入數據量過多造成運行緩慢。相關數值需合理調配