Java將PDF輸出爲Excel

前段時間應需求寫了一個將PDF輸出爲Excel的小程序，希望通過這篇博客給有同樣需求的人一些思路。
首先用到的語言是Java，其中引入了一些對PDF和Excel進行操作的包，主要思路就是先將PDF輸出爲txt文件，然後再爬取txt中的關鍵字和數據，輸出到Excel中。

下載PDFBox包

pdfbox-2.0.3.jar：http://apache.fayea.com/pdfbox/2.0.3/pdfbox-2.0.3.jar
（滿足一般的PDF操作需求）
pdfbox-app-2.0.3.jar：http://apache.fayea.com/pdfbox/2.0.3/pdfbox-app-2.0.3.jar
（PDFbox的多個命令行的工具包）
fontbox-2.0.3.jar：http://apache.fayea.com/pdfbox/2.0.3/fontbox-2.0.3.jar
（PDF使用的字庫包）

*爲了方便一些對eclipse操作不熟練的同學，我簡要介紹一下如何引入所需要的依賴包，大神可跳過。
右鍵項目->Build Path->Configure Build Path…->Add External JARs…
然後將你下好的jar包引入就可以了

話不多說直接上代碼

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

//將pdf文件輸出爲txt
    public static void PDFtoTXT(File pdf) {
        PDDocument pd;
        BufferedWriter wr;
        try {
            File input = pdf; 
            // The PDF file from where
            // you would like to
            // extract
            File output = new File(pdf.getName().split("\\.")[0] + ".txt");
            // The text file where
            // you are going to
            // store the
            // extracted data
            pd = PDDocument.load(input);
            pd.save("CopyOf" + pdf.getName().split("\\.")[0] + ".pdf"); // Creates a copy called
            // "CopyOfInvoice.pdf"
            PDFTextStripper stripper = new PDFTextStripper();
            wr = new BufferedWriter(new OutputStreamWriter(
                    new FileOutputStream(output)));
            stripper.writeText(pd, wr);
            if (pd != null) {
                pd.close();
            }
            // I use close() to flush the stream.
            wr.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

上述方法實現了將PDF文件輸出爲txt文件的功能

下面我們考慮如何從txt文件中抓出想要的關鍵字並輸出到Excel中
這裏需要引入另一個依賴包，這裏可以選擇的有jxl和POI
簡要分析jxl和poi的優缺點：（數據來源：http://blog.csdn.net/jarvis_java/article/details/4924099）

1、jxl

優點：
    -Jxl對中文支持非常好，操作簡單，方法看名知意。
    -Jxl是純javaAPI，在跨平臺上表現的非常完美，代碼可以再windows或者Linux上運行而無需重新編寫
    -支持Excel 95-2000的所有版本（網上說目前可以支持Excel2007了，還沒有嘗試過）
    -生成Excel 2000標準格式
    -支持字體、數字、日期操作
    -能夠修飾單元格屬性
    -支持圖像和圖表,但是這套API對圖形和圖表的支持很有限，而且僅僅識別PNG格式。
缺點：效率低，圖片支持不完善，對格式的支持不如POI強大

2、POI

優點：
    -效率高
    -支持公式，宏，一些企業應用上會非常實用
    -能夠修飾單元格屬性
    -支持字體、數字、日期操作
缺點：不成熟,代碼不能跨平臺，貌似不少同行在使用工程中還碰到讓人鬱悶的BUG（最近的項目中也是遇到了一些bug，不過目前沒有查出來是代碼的問題還是POI的問題，總之問題很詭異，數據替代參數總有失敗的。關於不能跨平臺這一說，我也沒有試驗過，不過Java不是跨平臺嗎？POI是JAVA的一個組件，怎麼就不能跨平臺了呢，總之這些問題還需要在以後的項目中多多實踐，才能比較出區別之處。）

這裏我選擇的是jxl，所以以jxl爲例

jxl.jar下載地址：

http://jaist.dl.sourceforge.net/project/jexcelapi/jexcelapi/2.6.12/jexcelapi_2_6_12.zip
下載完成後解壓裏面的jxl.jar
記得在項目中引入包，方法同上

import java.io.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import jxl.Workbook;
import jxl.write.Label;
import jxl.write.WritableSheet;
import jxl.write.WritableWorkbook;
import jxl.write.WriteException;
import jxl.write.biff.RowsExceededException;

//初始化excel
    public static WritableWorkbook wwb = null;
    public static WritableSheet ws;
    public static void initExcel(){
        try {
            //首先要使用Workbook類的工廠方法創建一個可寫入的工作薄(Workbook)對象
            wwb = Workbook.createWorkbook(new File("excel.xlsx"));
            if(wwb != null){
                //創建一個可寫入的工作表
                //Workbook的createSheet方法有兩個參數，第一個是工作表的名稱，第二個是工作表在工作薄中的位置
                ws = wwb.createSheet("sheet1", 0);
                //下面開始添加單元格
                //這裏需要注意的是，在Excel中，第一個參數表示列，第二個表示行
                ws.addCell(new Label(0, 0, "公司名")); //將生成的單元格添加到工作表中
                ws.addCell(new Label(1, 0, "財報時間"));
                ws.addCell(new Label(2, 0, "創新"));
                ws.addCell(new Label(3, 0, "財政補貼"));
                ws.addCell(new Label(4, 0, "優惠"));
                ws.addCell(new Label(5, 0, "稅收優惠"));
                ws.addCell(new Label(6, 0, "遞延所得稅費用"));
                ws.setColumnView(0, 30);
                ws.setColumnView(1, 10);
                ws.setColumnView(3, 14);
                ws.setColumnView(6, 15);
            }
        } catch (IOException e) {
            e.printStackTrace();
        } catch (RowsExceededException e) {
            e.printStackTrace();
        } catch (WriteException e) {
            e.printStackTrace();
        }
    }

    //爬取目錄下的txt文件中關鍵字並輸出到excel文件中
    public static boolean isMain = false;
    public static boolean isYear = false;
    public static String name = null;
    public static String time = null;
    public static void TXTtoEXCEL(File txt) {
        isMain = true;
        isYear = true;
        name = null;
        time = null;
        try {
            BufferedReader br = new BufferedReader(new FileReader(txt));
            System.out.println(txt.getName());
            String line = "";

            while((line = br.readLine()) != null) {
                Pattern p0 = Pattern.compile("\\s*(([（）]|[\u4E00-\u9FA5])*(股份有限公司))");
                Matcher m0 = p0.matcher(line);
                if (m0.find() && true == isMain) {
                    System.out.println(line);                   
                    System.out.println("公司名=" + m0.group(1));
                    name = m0.group(1);
                    isMain = false;
                    continue;
                }

                Pattern p1 = Pattern.compile("\\s*(([一二三四五六七八九十〇○]|[0-9])*\\s*年)");
                Matcher m1 = p1.matcher(line);
                if(m1.find() && true == isYear) {
                    System.out.println(line.replaceAll("○", "〇"));
                    System.out.println("財報時間=" + m1.group(1).replaceAll("○", "〇"));
                    time = m1.group(1).replaceAll("○", "〇");
                    isYear = false;
                    continue;
                }

                if(name != null && time != null) {
                    break;
                }

            }

            while((line = br.readLine()) != null) {
                Pattern p2 = Pattern.compile("\\s*(創新)(\\s*)(--\\s)?((-*)(\\d*,)*(\\d*\\.\\d*))");
                Matcher m2 = p2.matcher(line);
                if(m2.find()) {
                    System.out.println(line);
                    System.out.println("創新=" + m2.group(4));
                    System.out.println(count);
                    ws.addCell(new Label(0, count, name));
                    ws.addCell(new Label(1, count, time));
                    ws.addCell(new Label(2, count, m2.group(4)));
                    count ++;
                    continue;
                }

                Pattern p3 = Pattern.compile("\\s*(政府補助|財政補貼)(\\s*)(--\\s)?((-*)(\\d*,)*(\\d*\\.\\d*))");
                Matcher m3 = p3.matcher(line);
                if(m3.find()) {
                    System.out.println(line);
                    System.out.println("政府補貼=" + m3.group(4));
                    System.out.println(count);
                    ws.addCell(new Label(0, count, name));
                    ws.addCell(new Label(1, count, time));
                    ws.addCell(new Label(3, count, m3.group(4)));
                    count ++;
                    continue;
                }

                Pattern p4 = Pattern.compile("\\s*(優惠)(\\s*)(--\\s)?((-*)(\\d*,)*(\\d*\\.\\d*))");
                Matcher m4 = p4.matcher(line);
                if(m4.find()) {
                    System.out.println(line);
                    System.out.println("優惠=" + m4.group(4));
                    System.out.println(count);
                    ws.addCell(new Label(0, count, name));
                    ws.addCell(new Label(1, count, time));
                    ws.addCell(new Label(4, count, m4.group(4)));
                    count ++;
                    continue;
                }

                Pattern p5 = Pattern.compile("\\s*(稅收優惠)(\\s*)(--\\s)?((-*)(\\d*,)*(\\d*\\.\\d*))");
                Matcher m5 = p5.matcher(line);
                if(m5.find()) {
                    System.out.println(line);
                    System.out.println("稅收優惠=" + m5.group(4));
                    System.out.println(count);
                    ws.addCell(new Label(0, count, name));
                    ws.addCell(new Label(1, count, time));
                    ws.addCell(new Label(5, count, m5.group(4)));
                    count ++;
                    continue;
                }

                Pattern p6 = Pattern.compile("\\s*(遞延所得稅費用|遞延所得稅資產)(\\s*)(--\\s)?((-*)(\\d*,)*(\\d*\\.\\d*))");
                Matcher m6 = p6.matcher(line);
                if(m6.find()) {
                    System.out.println(line);
                    System.out.println("遞延所得稅費用=" + m6.group(4));
                    System.out.println(count);
                    ws.addCell(new Label(0, count, name));
                    ws.addCell(new Label(1, count, time));
                    ws.addCell(new Label(6, count, m6.group(4)));
                    count ++;
                    continue;
                }
            }

            if(br != null) {
                br.close();
                br = null;
            }
        } catch (IOException e) {
           e.printStackTrace();
        } catch (RowsExceededException e) {
            e.printStackTrace();
        } catch (WriteException e) {
            e.printStackTrace();
        }
    }

最後加上main方法，批量處理目錄下的pdf文件

public static void main(String[] args) {
        //將當前目錄下的pdf文件轉成txt文件
        File f = new File(System.getProperty("user.dir")); //獲得當前路徑
        File[] files = f.listFiles();
        for(File pdf : files) {
            //對pdf文件進行操作
            if(pdf.getName().matches(".*\\.pdf$")) {
                PDFtoTXT(pdf);
            }
        }

        files = f.listFiles(); //重新獲得當前目錄下的文件
        initExcel();
        for(File txt : files) {
            //對txt文件進行操作
            if(txt.getName().matches(".*\\.txt$")) {
                TXTtoEXCEL(txt);
            }
        }

        try {
            //從內存中寫入文件中
            wwb.write();
            //關閉資源，釋放內存
            wwb.close();
        } catch (IOException e) {
            e.printStackTrace();
        } catch (WriteException e) {
            e.printStackTrace();
        }


//      刪除產生的txt文件
        for(File file : files) {
            if(file.getName().matches(".*\\.pdf$")) {
                new File(file.getName().split("\\.")[0] + ".txt").delete();
            }
        }

    }

ForTheDreamSMS

發佈了56 篇原創文章 · 獲贊 118 · 訪問量 22萬+

私信關注

Java將PDF輸出爲Excel

下載PDFBox包

1、jxl

2、POI

jxl.jar下載地址：

Python 潮流週刊#50：我最喜歡的 Python 3.13 新特性！

PyQt5教程(九)——繪圖

Qt中QString轉string中文亂碼問題

UEFI使用rEFInd引導Win10+Deepin雙系統

Linux內核實驗(一)：proc文件系統

Linux內核實驗(二)：shell命令解釋系統

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結