統計Hive表佔用空間大小

原創

2020-06-22 04:41

首先，Hive的數據是存在HDFS上的，統計Hive表大小就算統計文件的大小。
雖然在Hive的管理界面可以看到HDFS文件佔用的整體大小，但如何查看每張表佔用的空間呢？
幾經搜索，沒有發現hive現成的命令，於是，動手擼一個出來吧。

基於python實現

下面這部分代碼是非常低效的，因爲對hive命令不瞭解，所以走了彎路，但實際上是可行的，20T的文件，共一百多張表，用41個線程跑了2小時才跑完。

核心就是：hadoop fs -ls /path

# hadoop fs -ls /user/hive/warehouse
Found 1 items
drwxr-xr-x   - root supergroup          0 2020-05-11 19:46 /user/hive/warehouse/test.db

然後逐層深入下去：庫–》表–》文件/分區目錄再是文件

import subprocess
import threading
import logging
import time


def get_logger(name):
    logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
    handler = logging.FileHandler(filename=name)
    logger = logging.getLogger('jimo_' + name)
    logger.addHandler(handler)
    return logger


log = get_logger('info.log')
data_writer = get_logger('result.txt')


def list_dir_path(path):
    # print('list_item='+path)
    byte_str = subprocess.check_output(['hadoop fs -ls ' + path], shell=True)
    return [line.split()[7] for line in [ss.decode('utf-8') for ss in byte_str.splitlines()] if line.startswith('d')]


def list_dir_files(path):
    # print('list_lines='+path)
    byte_str = subprocess.check_output(['hadoop fs -ls ' + path], shell=True)
    return [line for line in [ss.decode('utf-8') for ss in byte_str.splitlines()] if
            line.startswith('-') or line.startswith('d')]


def get_one_file_size(line):
    return int(line.split()[4])


class TableSizeStat(threading.Thread):

    def __init__(self, name, db_name, table_path_list):
        threading.Thread.__init__(self)
        self.name = name
        self.db_name = db_name
        self.table_path_list = table_path_list

    def run(self):
        log.info('線程{}開始運行{}'.format(self.name, time.time()))
        self.count_size()
        log.info('線程{}結束運行{}'.format(self.name, time.time()))

    def count_size(self):
        for table in self.table_path_list:
            lines = list_dir_files(table)
            # 如果是文件夾，就繼續向下(因爲有個表數據很多，會出現分區，分區就是目錄)
            # 否則計算大小
            table_size = 0
            for line in lines:
                if line.startswith('d'):
                    files = list_dir_files(line.split()[7])
                    for file in files:
                        table_size += get_one_file_size(file)
                elif line.startswith('-'):
                    table_size += get_one_file_size(line)

            log.info('庫{} 表{} 的大小爲: {} MB'.format(self.db_name, table, table_size / 1024 / 1024))
            data_writer.info('{},{},MB'.format(table, table_size / 1024 / 1024))


def cal():
    db_path = list_dir_path('/user/hive/warehouse')
    table_paths = []
    for p in db_path:
        t_p = list_dir_path(p)
        table_paths.extend(t_p)
    log.info('總共有{0}張表'.format(len(table_paths)))
    run_thread(table_paths)


def run_thread(table_paths):
    print(table_paths)
    max_thread = 16
    i = 0
    step = int(len(table_paths) / max_thread) + 1
    ts = []
    for j in range(min(len(table_paths), max_thread)):
        frag = table_paths[i:i + step]
        if len(frag) == 0:
            continue
        log.info('開始構建線程{}：{}'.format(j, frag))
        t = TableSizeStat('thread' + str(j), '', frag)
        t.start()
        ts.append(t)
        i += step
    log.info('等待執行完')
    for t in ts:
        t.join()


if __name__ == '__main__':
    cal()
    log.info('結束:{}'.format(time.time()))

基於shell實現

後來發現了一個更見簡單的命令，可以直接統計一個目錄下所有文件佔的空間大小：
hadoop fs -count /path

# hadoop fs -count /user/hive/warehouse/test.db
           2            1                704 /user/hive/warehouse/test.db
# hadoop fs -ls /user/hive/warehouse/test.db
Found 1 items
drwxr-xr-x   - root supergroup          0 2020-05-11 19:50 /user/hive/warehouse/test.db/t1
# hadoop fs -ls /user/hive/warehouse/test.db/t1
Found 1 items
-rwxr-xr-x   2 root supergroup        704 2020-05-11 19:50 /user/hive/warehouse/test.db/t1/data.csv

第一個數值2表示test.d下的文件（夾）的總數
第二個數值1表是當前文件夾下文件（夾）的個數
第三個數值704表示該文件夾下文件所佔的空間大小，這個大小是不計算副本的個數的（單位字節）

於是我重寫了一份：

dbs=$(hadoop fs -ls /user/hive/warehouse | awk '{print $8}')

for db in $dbs
do
  echo "統計庫：$db"
  tables=$(hadoop fs -ls "$db" | awk '{print $8}')
  for table in $tables
  do
    echo "統計表：$table"
    size=$(hadoop fs -count -h "$table" | awk '{print $3}')
    echo "表 $table 佔用空間爲：$size"
    echo "$table,$size" >> result.csv
  done
done

按照每秒一個命令的執行速度，大約2分鐘就能執行完。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

統計Hive表佔用空間大小

基於python實現

基於shell實現

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

spark jobserver加入認證

統計Hive表佔用空間大小

scala惰性函數

B樹詳細圖解與Java完整實現

當我想爲程序員的生活寫本書...

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結