數據治理中 PyODPS 的正確使用方式

原創

2020-06-28 21:51

數據治理中 PyODPS 的正確使用方式

概述：表飽和度（字段是否爲空）、字段閾值（數值類字段取值是否超出有效邊界）是評估數據質量的關鍵指標，由於是單表內字段級別的校驗和統計，並且幾乎涉及所有表，範圍大、邏輯簡單、重複性強，結合 Python 開發效率高的特點，很多數據工程師會使用 PyODPS 進行相關功能的開發。本文基於 PyODPS 分別使用 3 種方式實現了“飽和度統計”功能，展示了它們的執行效率，並分析了原因。

結論：1. 除非數據量極少，否則要避免把數據拉取到本地處理；2. 執行 SQL 的方式效率最高，並且直觀，如果只是飽和度的場景，推薦這種方式，但是受 SQL 語法的限制，不夠靈活；3. DataFrame SDK 的方式效率雖然略低，但使用方式非常靈活，並且可以把常用處理邏輯封裝成函數，代碼複用率更高；

測試環境： Mac Book Pro | 4C/16G/512G

1. 通過 open_reader 和分區級別併發實現

測試表數據量	分區數	運行時間
332萬	5	22分40秒

分析: 通過 open_reader 把表內數據拉取到本地進行檢驗和統計，雖然代碼中使用了多線程，但是並沒有“真正”的併發：1、沒有利用 ODPS 的計算能力，而是使用了本地的計算能力；2、Python 的 GIL（全局解釋器鎖）使線程之間在串行執行。如果數據量極少，這種方式的優勢是節省了創建 ODPS 實例的時間和資源開銷。

2. 通過 execute_sql 全表掃實現

測試表數據量	分區數	運行時間
6萬	77	21秒
1600萬	23	18秒
3.1億	14	50秒

分析：與在 DataWorks 上面執行 SQL 情況相同，只要能把 SQL 拼出來，就能實現想要的功能。但是，如果所在 Project 限制了全表掃，則需要 set odps.sql.allow.fullscan=true; 操作。缺點是，檢驗邏輯在 SQL 中，是靠拼字符串拼出來的，代碼很難複用。

3. 通過 DataFrame 實現

測試表數據量	分區數	運行時間
6萬	77	53秒
1600萬	23	39秒
3.1億	14	5分鐘56秒

分析：使用了 DataFrame 的 MapReduce API 與聚合函數 sum，DataFrame 會提交 1 個實例，把 DateFrame 操作轉換爲 UDF SQL，通過 Mapper 把原表數據轉換爲 int 類型的計數，然後進行 sum 操作。DataFrame 比 SQL 執行效率低的原因，在於自定義 Python 函數與內置函數的性能差異，例如，作者曾使用自己編寫的 Reducer 代替 sum ，效率會大大降低。

代碼

#!/usr/bin/python
# -*- coding: UTF-8 -*-
import sys

reload(sys)
sys.setdefaultencoding("utf-8")

from odps import ODPS
from odps.models.resource import *
import datetime
from odps import options
from datetime import datetime, timedelta
from time import sleep
from odps.df import DataFrame
from odps.df import output
from queue import Queue
import threading

o = ODPS('xxx', 'xxx', 'xxx', endpoint='xxx')

options.interactive = True  # 用到 print 需要打開
options.verbose = True      # 輸出進度信息和錯誤信息


def mapper(row):
    ret = [1]
    for i in range(len(row)):
        ret.append(1 if row[i] is None or str(row[i]).isspace() or len(str(row[i])) <= 0 else 0)
    yield tuple(ret)

def data_frame(table_name):
    findnull = DataFrame(o.get_table(table_name))
    col_num = len(findnull.dtypes)
    output_types = ['int' for i in range(col_num + 1)]
    output_names = [findnull.dtypes.columns[i].name for i in range(col_num)]
    output_names.insert(0, 'total_cnt')
    table = findnull.map_reduce(mapper, mapper_output_names=output_names, mapper_output_types=output_types)
    print(table.sum())

def check_data_by_execute_sql(table_name):
    ta = o.get_table(table_name)
    data_count = {}
    table_count = 0
    sql_str = 'select \n'
    for col in ta.schema.columns:
        col_name = col.name
        sql_str += "sum(case when (%s is null) or (%s = '') or (trim(%s) = '') then 1 else 0 end) as %s_yx,\n" % (col_name, col_name, col_name, col_name)
    sql_str += "count(1) as total_cnt \nfrom %s " %(table_name)
    print(sql_str)
    with o.execute_sql(sql_str).open_reader() as rec:
        for r in rec:
            for col in ta.schema.columns:
                print("%s\t\t%d" % (col.name, r.get_by_name(col.name + '_yx')))
            print("%s\t\t%d" % ('total_cnt', r.get_by_name('total_cnt')))

def get_last_day():
    today = datetime.today()
    last_day = today + timedelta(days=-1)
    return last_day.strftime('%Y%m%d')

count_queue = Queue()
threads = []

def check_data_by_open_reader(table_name, pt):
    ta = o.get_table(table_name)
    data_count = {}
    print(table_name + "\t:\t" + str(pt) + " STARTED")
    rec = ta.open_reader(partition=str(pt))
    table_count = rec.count
    for r in rec:
        for col in ta.schema:
            col_value = r.get_by_name(col.name)
            if col.name not in data_count:
                data_count[col.name] = 0
            if col_value == None or str(col_value).isspace() or len(str(col_value)) <= 0:
                data_count[col.name] += 1
    count_queue.put((data_count, table_count))
    print(table_name + "\t:\t" + str(pt) + " DONE")

# 假設 dt 爲分區字段
def check_data(table_name):
    table_tocheck = o.get_table(table_name)
    for pt in table_tocheck.iterate_partitions("dt='" + get_last_day() + "'"):  
        t = threading.Thread(target=check_data_by_open_reader, args=(table_name, pt))
        t.setDaemon(True)
        t.start()
        threads.append(t)

    print("線程數共:" + str(len(threads)))

    while True:
        thread_num = len(threading.enumerate()) - 1
        print("線程數量是%d" % thread_num)
        if thread_num <= 0:
            break
        sleep(10)

    total_cnt = 0
    total_data_cnt = {}
    while not count_queue.empty():
        pt_data = count_queue.get()
        data_count = pt_data[0]
        total_cnt += pt_data[1]
        for col_name in data_count.keys():
            if col_name not in total_data_cnt:
                total_data_cnt[col_name] = 0
            total_data_cnt[col_name] += data_count[col_name]

    print(total_cnt, total_data_cnt)

if __name__ == '__main__':
    table_name = 'xxxx' 

    if len(sys.argv) == 2:
        if sys.argv[1] not in ('1', '2', '3'):
            print("ARG ERROR: %s 1|2|3" % sys.argv[0])
            exit()
        print(datetime.now().strftime('%Y-%m-%d %H:%M:%S  BEGIN with ' + table_name))
        if sys.argv[1] == '1':
            check_data(table_name)
        elif sys.argv[1] == '2':
            check_data_by_execute_sql(table_name)
        elif sys.argv[1] == '3':
            data_frame(table_name)
        print(datetime.now().strftime('%Y-%m-%d %H:%M:%S  DONE with ' + table_name))
    else:
        print("ARG ERROR: %s 1|2|3" % sys.argv[0])

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

數據治理中 PyODPS 的正確使用方式

數據治理中 PyODPS 的正確使用方式

1. 通過 open_reader 和分區級別併發實現

2. 通過 execute_sql 全表掃實現

3. 通過 DataFrame 實現

代碼

美團一面：項目中有 10000 個 if else 如何優化？想了半天，被問懵了！

京東面試：如何進行JVM調優？

Python 將PowerPoint (PPT/PPTX) 轉爲HTML

SQL優化-20231016

Python-01-基礎語法

MySQL-03-故障轉移

Python-12-lambda

Coding-11-生產者消費者

Win10 爲 python 安裝 opencv

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結