首先,Hive的數據是存在HDFS上的,統計Hive表大小就算統計文件的大小。
雖然在Hive的管理界面可以看到HDFS文件佔用的整體大小,但如何查看每張表佔用的空間呢?
幾經搜索,沒有發現hive現成的命令,於是,動手擼一個出來吧。
基於python實現
下面這部分代碼是非常低效的,因爲對hive命令不瞭解,所以走了彎路,但實際上是可行的,20T的文件,共一百多張表,用41個線程跑了2小時才跑完。
核心就是:hadoop fs -ls /path
# hadoop fs -ls /user/hive/warehouse
Found 1 items
drwxr-xr-x - root supergroup 0 2020-05-11 19:46 /user/hive/warehouse/test.db
然後逐層深入下去:庫–》表–》文件/分區目錄再是文件
import subprocess
import threading
import logging
import time
def get_logger(name):
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler = logging.FileHandler(filename=name)
logger = logging.getLogger('jimo_' + name)
logger.addHandler(handler)
return logger
log = get_logger('info.log')
data_writer = get_logger('result.txt')
def list_dir_path(path):
# print('list_item='+path)
byte_str = subprocess.check_output(['hadoop fs -ls ' + path], shell=True)
return [line.split()[7] for line in [ss.decode('utf-8') for ss in byte_str.splitlines()] if line.startswith('d')]
def list_dir_files(path):
# print('list_lines='+path)
byte_str = subprocess.check_output(['hadoop fs -ls ' + path], shell=True)
return [line for line in [ss.decode('utf-8') for ss in byte_str.splitlines()] if
line.startswith('-') or line.startswith('d')]
def get_one_file_size(line):
return int(line.split()[4])
class TableSizeStat(threading.Thread):
def __init__(self, name, db_name, table_path_list):
threading.Thread.__init__(self)
self.name = name
self.db_name = db_name
self.table_path_list = table_path_list
def run(self):
log.info('線程{}開始運行{}'.format(self.name, time.time()))
self.count_size()
log.info('線程{}結束運行{}'.format(self.name, time.time()))
def count_size(self):
for table in self.table_path_list:
lines = list_dir_files(table)
# 如果是文件夾,就繼續向下(因爲有個表數據很多,會出現分區,分區就是目錄)
# 否則計算大小
table_size = 0
for line in lines:
if line.startswith('d'):
files = list_dir_files(line.split()[7])
for file in files:
table_size += get_one_file_size(file)
elif line.startswith('-'):
table_size += get_one_file_size(line)
log.info('庫{} 表{} 的大小爲: {} MB'.format(self.db_name, table, table_size / 1024 / 1024))
data_writer.info('{},{},MB'.format(table, table_size / 1024 / 1024))
def cal():
db_path = list_dir_path('/user/hive/warehouse')
table_paths = []
for p in db_path:
t_p = list_dir_path(p)
table_paths.extend(t_p)
log.info('總共有{0}張表'.format(len(table_paths)))
run_thread(table_paths)
def run_thread(table_paths):
print(table_paths)
max_thread = 16
i = 0
step = int(len(table_paths) / max_thread) + 1
ts = []
for j in range(min(len(table_paths), max_thread)):
frag = table_paths[i:i + step]
if len(frag) == 0:
continue
log.info('開始構建線程{}:{}'.format(j, frag))
t = TableSizeStat('thread' + str(j), '', frag)
t.start()
ts.append(t)
i += step
log.info('等待執行完')
for t in ts:
t.join()
if __name__ == '__main__':
cal()
log.info('結束:{}'.format(time.time()))
基於shell實現
後來發現了一個更見簡單的命令,可以直接統計一個目錄下所有文件佔的空間大小:
hadoop fs -count /path
# hadoop fs -count /user/hive/warehouse/test.db
2 1 704 /user/hive/warehouse/test.db
# hadoop fs -ls /user/hive/warehouse/test.db
Found 1 items
drwxr-xr-x - root supergroup 0 2020-05-11 19:50 /user/hive/warehouse/test.db/t1
# hadoop fs -ls /user/hive/warehouse/test.db/t1
Found 1 items
-rwxr-xr-x 2 root supergroup 704 2020-05-11 19:50 /user/hive/warehouse/test.db/t1/data.csv
- 第一個數值2表示
test.d
下的文件(夾)的總數 - 第二個數值1表是當前文件夾下文件(夾)的個數
- 第三個數值704表示該文件夾下文件所佔的空間大小,這個大小是不計算副本的個數的(單位字節)
於是我重寫了一份:
dbs=$(hadoop fs -ls /user/hive/warehouse | awk '{print $8}')
for db in $dbs
do
echo "統計庫:$db"
tables=$(hadoop fs -ls "$db" | awk '{print $8}')
for table in $tables
do
echo "統計表:$table"
size=$(hadoop fs -count -h "$table" | awk '{print $3}')
echo "表 $table 佔用空間爲:$size"
echo "$table,$size" >> result.csv
done
done
按照每秒一個命令的執行速度,大約2分鐘就能執行完。