python解析pdf，pdfplumber和tabula

最近做了一個需要解析財報pdf的項目，財報的格式大致一樣，但是具體細節會有略微不同。

原本是使用pdfplumber來做，做到一半，發現 pdfplumber對於分頁了的表格處理很不友好。

原本處理分頁的表格，是將上一頁的最後一個表格和下一頁的第一個表格拼接，但是 pdfplumber 解析的表格出現亂序的情況，最後一個表格的位置出現在解析出的表格列表中間位置，導致合併表格數據失敗。

所以中途又重新開始找解析框架，找到了 tabula，這個對於表格處理的某些方面比 pdfplumber 好，至少不會出現表格亂序的情況。

但是這個框架只支持pdf表格解析，不支持文字解析，所以最終還是 pdfplumber 和 tabula 混合使用。

總結來說：

pdfplumber:

優點：

對於文字的解析非常優秀，沒有發現錯字漏字的情況
對於普通表格的解析也很棒

缺點：

對於表格分頁的情況處理很薄弱
合併單元格的表格解析會不夠理想，但是效果還是要比tabula好。
有一個可視化表格工具，但那個工具巨難裝，我裝了一天半都沒成功。

tabula：

優點：

專門用於處理pdf裏的表格，對於表格分頁的情況很理想
表格結果使用pandas的DataFrame數據格式包裝，處理數據很強大。
有一個可視化應用exe，安裝即可用（我沒用過

缺點：

pandas的DataFrame很強大是沒錯，但是對於不熟悉的人來說學習成本也很高。
合併單元格的表格形式，解析效果非常差勁，數據出現過缺失和亂序的情況
是java編寫的，所以依賴於jdk

簡單示例：

pdfplumber，文檔地址：https://github.com/jsvine/pdfplumber

# 解析pdf
with pdfplumber.open("abc.pdf") as pdf:
    # 拿到第一頁的對象
    page = parse_pdf.pages[0]
    # 拿到這一頁的文本數據
    text = page.extract_text()
    # 拿到這一頁的所有表格數據
    tables = page.extract_tables()
    # 遍歷表格
    for t_index in range(len(tables)):
        table = tables[t_index]
        # 遍歷每一行的數據
        for data in table:
            print(data)

tabula，文檔地址：https://aegis4048.github.io/parse-pdf-files-while-retaining-structure-with-tabula-py

# 解析表格， stream表示流模式識別（建議），guess爲猜測，pages是頁面下標，從1開始
# multiple_tables是需不需要識別多個表格
tables = tabula.read_pdf(pdf_path, stream=True, guess=True, pages=    [1,2],multiple_tables=True)
    # 遍歷表
    for table in tables:
        # 通過表格內置下標迭代器來遍歷下標（也可能不是下標，而是id
        for index in table.index:
            # 獲取下標所屬那一行的值
            data = table.loc[index].values

彩蛋（tabula 表格去除全部爲空的行和列）：

def format_data_frame(table):
    """
    格式化data_frame表格，刪除全部爲空的行和列
    :return:
    """
    all_data = table.isna()
    for t_index in all_data.index:
        data = all_data.loc[t_index].values
        if all(data):
            table = table.drop([t_index], axis=0, inplace=False)

    for clo_name in all_data.columns:
        data = all_data[clo_name]
        if all(data):
            table = table.drop(clo_name, axis=1, inplace=False)
    return table

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python解析pdf，pdfplumber和tabula

python gdal 安裝使用（Windows， python 3.6.8）

一條命令停止gunicorn進程

自建sentry服務器後，無法收到郵件問題（已解決

CentOS7下docker 主要命令總結（全）

fabric 自動部署falsk 應用

python解析pdf，pdfplumber和tabula

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結