Hive analyze命令解析

關於Hive analyze命令
1. 命令用法:
表與分區的狀態信息統計
ANALYZE TABLE tablename
[PARTITION(partcol1[=val1], partcol2[=val2], ...)]
COMPUTE STATISTICS [noscan];
列信息統計
ANALYZE TABLE tablename
[PARTITION(partcol1[=val1], partcol2[=val2], ...)]
COMPUTE STATISTICS FOR COLUMNS ( columns name1 , columns name2…) [noscan];
當表存在分區時,需要在命令中指定,否則會報錯;
不支持使用列與表的別名
2. 某個有分區表的analyze命令執行結果:
Partition default.test{dt=a} stats: [num_files: 1, num_rows: 0, total_size: 41, raw_data_size: 0]
Table default.test stats: [num_partitions: 1, num_files: 1, num_rows: 0, total_size: 41, raw_data_size: 0]
3. 源碼分析執行過程
命令的執行步驟:
當完成命令轉化完ast樹時 進入ColumnStatsSemanticAnalyzer類。
1) 命令類型的檢查( 比如 no scan , partial scan 等)
2) 查詢重寫,例如執行以下查詢:
analyze table pokes compute statistics for columns foo,bar;
以上查詢會根據ast樹,獲取 表名, 列名,列的數量和類型。分區名字,數量等信息。生成一個新的查詢:
select compute_stats(foo , 16 ) , compute_stats(bar , 16 ) from pokes
3) 生成新的ast樹
4) 回到Driver 完成語法分析,生成查詢計劃。 在做語法分析時使用了新的ast樹和原有的ctx
5) 生成一個列統計的任務替代fetch task,並寫統計信息到metastore中。列統計任務的生成是通過MapReduceCompiler 類中的genColumsStatsTask方法來完成的,每個task中都有對應的work 。核心代碼如下:

對於2中的查詢,所生成的rootTasks 如下:

上圖中的MapRedTask會執行一次聚合操作的RS.
6) 生成plan 。 對於2中的查詢,將會生成如下的plan:
{"queryId":"zhangyun_20140403102424_07e3332f-12b9-4c54-b30f-f5fc912bb032","queryType":null,"queryAttributes":{"queryString":"analyze table pokes compute statistics for columns foo,bar"},"queryCounters":"null","stageGraph":{"nodeType":"STAGE","roots":"null","adjacencyList":"]"},"stageList":[{"stageId":"Stage-0","stageType":"MAPRED","stageAttributes":"null","stageCounters":"}","taskList":[{"taskId":"Stage-0_MAP","taskType":"MAP","taskAttributes":"null","taskCounters":"null","operatorGraph":{"nodeType":"OPERATOR","roots":"null","adjacencyList":[{"node":"TS_0","children":["SEL_1"],"adjacencyType":"CONJUNCTIVE"},{"node":"SEL_1","children":["GBY_2"],"adjacencyType":"CONJUNCTIVE"},{"node":"GBY_2","children":["RS_3"],"adjacencyType":"CONJUNCTIVE"}]},"operatorList":[{"operatorId":"TS_0","operatorType":"TABLESCAN","operatorAttributes":"null","operatorCounters":"null","done":"false","started":"false"},{"operatorId":"SEL_1","operatorType":"SELECT","operatorAttributes":"null","operatorCounters":"null","done":"false","started":"false"},{"operatorId":"GBY_2","operatorType":"GROUPBY","operatorAttributes":"null","operatorCounters":"null","done":"false","started":"false"},{"operatorId":"RS_3","operatorType":"REDUCESINK","operatorAttributes":"null","operatorCounters":"null","done":"false","started":"false"}],"done":"false","started":"false"},{"taskId":"Stage-0_REDUCE","taskType":"REDUCE","taskAttributes":"null","taskCounters":"null","operatorGraph":{"nodeType":"OPERATOR","roots":"null","adjacencyList":[{"node":"GBY_4","children":["SEL_5"],"adjacencyType":"CONJUNCTIVE"},{"node":"SEL_5","children":["FS_6"],"adjacencyType":"CONJUNCTIVE"}]},"operatorList":[{"operatorId":"GBY_4","operatorType":"GROUPBY","operatorAttributes":"null","operatorCounters":"null","done":"false","started":"false"},{"operatorId":"SEL_5","operatorType":"SELECT","operatorAttributes":"null","operatorCounters":"null","done":"false","started":"false"},{"operatorId":"FS_6","operatorType":"FILESINK","operatorAttributes":"null","operatorCounters":"null","done":"false","started":"false"}],"done":"false","started":"false"}],"done":"false","started":"false"},{"stageId":"Stage-1","stageType":"COLUMNSTATS","stageAttributes":"null","stageCounters":"}","taskList":[{"taskId":"Stage-1_OTHER","taskType":"OTHER","taskAttributes":"null","taskCounters":"null","operatorGraph":"null","operatorList":"]","done":"false","started":"false"}],"done":"false","started":"false"}],"done":"false","started":"false"}
7) 列統計信息的輸出schema
Schema(
fieldSchemas:[FieldSchema(name:_c0,type:struct<columntype:string,min:bigint,max:bigint,countnulls:bigint,numdistinctvalues:bigint>,comment:null),
FieldSchema(name:_c1,type:struct<columntype:string,maxlength:bigint,avglength:double,countnulls:bigint,numdistinctvalues:bigint>, comment:null)],
properties:null)
以下是一些總結的資料:
1:列統計
針對表中的列數,特定列數據的直方圖,有多種方式可以實現。作爲查詢優化的一種方法,統計輸入給優化器的代價函數,然後優化器比較不同的計劃,並從中獲取較優的計劃。
統計有時能夠滿足用戶的查詢,從而讓用戶,快速獲取結果(需執行存儲的統計信息,而不需要觸發長時間的執行計劃)
注: 以上來自wiki 但目前還沒有實現統計輸入給優化器的這種優化。
2:範圍
Hive 現在支持的表和分區級別的統計, 不支持列中數據的統計。由於這些表和分區的統計不足以完成1中所述通過cost model 計算獲取最優的計算。
統計首先要支持表和分區,這些統計會存在MetaStore中
比如分區:
• Number of Rows
• Number of files
• Size in Bytes.
針對表,還包括表中的分區的格式Number of Partitions
針對分區級別的統計,可以實現列級別前N個值Top K Statistics

3:實現
針對新創建的表,如果一個JOB創建一個表通過MapReduce Job,每個Mapper在複製列時,對應收集統計信息在Job結束時,也被彙總存在在MetaStore中
針對已經存在的表,在表掃描操作時,也會蒐集相應的統計信息,並且存儲在結果中
當然,需要一個數據庫存儲臨時的統計數據:MySQL或者HBase
有兩個接口
IStatsPublisher
IStatsAggregateor
4. 具體接口內容:
public interface IStatsPublisher{
public boolean init(Configuration hconf);
 
public boolean publishStat(String rowID,String key,String value);
 
public boolean terminate();
}
 
public interface IStatsAggregator{
public boolean init(Configuration hconf);
 
public String aggregateStats(String rowID,String key);
 
public boolean terminate();
}
場景:
1:新創建的表
針對通過INSERT OVERWRITE生成的表和分區,統計會自動計算生成,可以通過配置控制是否生效
set hive.stats.autogather=false;
將不會生成統計信息
用戶可以自定義統計的實現,來指定臨時統計信息的存儲
hive.stats.dbclass=hbase
將通過hbase來存儲
缺省爲{{jdbc:derby}}
針對通過JDBC來實現臨時存儲統計(Derby或者mysql),用戶可以指定對應的連接字符變量
set hive.stats.dbclass=jdbc:derby;
set hive.stats.dbconnectionstring="jdbc:derby:;databaseName=TempStatsStore;create=true";
set hive.stats.jdbcdriver="org.apache.derby.jdbc.EmbeddedDriver";
針對查詢可能會無法準確的收集統計信息
可通過hive.stats.reliable可以設置如果不能夠可靠的收集統計信息,則查詢失敗,缺省是false
2:已存在的表
對應已存在的表,需要通過
ANALYZE來蒐集統計信息並寫入MetaStore
ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] COMPUTE STATISTICS [noscan];
如果用戶執行該命令,但是不指定任何分區時,會蒐集所有表和分區,如果指定分區則只蒐集該分區的統計信息
如果統計是跨分區的,則分區列仍然需要指定
如果noscan參數指定,該命令就不掃描文件,從而更快,但是此時的統計就只限於如下項:
• Number of files
• Physical size in bytes

實例:
假定有一個表具有4個分區
• Partition1: (ds='2013-03-08', hr=11)
• Partition2: (ds='2013-03-08', hr=12)
• Partition3: (ds='2013-03-09', hr=11)
• Partition4: (ds='2013-03-09', hr=12)

執行命令如下
ANALYZE TABLE Table1 PARTITION(ds='2013-03-09', hr=11) COMPUTE STATISTICS;
將只蒐集分區3
如果執行
ANALYZE TABLE Table1 PARTITION(ds='2013-03-09', hr) COMPUTE STATISTICS;
則同時蒐集分區間3和4

如果執行
ANALYZE TABLE Table1 PARTITION(ds, hr) COMPUTE STATISTICS;
則蒐集所有分區

針對分分區表,通過
ANALYZE TABLE Table1 COMPUTE STATISTICS;

查看分區的統計信息:
DESCRIBE EXTENDED TABLE1;
輸入類似:
... , parameters:{numPartitions=4, numFiles=16, numRows=2000, totalSize=16384, ...}, ....
實例:
DESCRIBE EXTENDED TABLE1 PARTITION(ds='2013-03-09', hr=11);
 

發佈了127 篇原創文章 · 獲贊 76 · 訪問量 45萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章