java操作office和pdf文件:讀取word,excel和pdf文檔內容

  1. 引用POI包讀取word文檔內容
    poi.jar 下載地址
    http://apache.freelamp.com/poi/release/bin/poi-bin-3.6-20091214.zip
    http://apache.etoak.com/poi/release/bin/poi-bin-3.6-20091214.zip
    http://labs.renren.com/apache-mirror/poi/release/bin/poi-bin-3.6-20091214.zip

2.引用jxl包讀取excel文檔的內容
Jxl.jar下載地址
http://nchc.dl.sourceforge.net/project/jexcelapi/CSharpJExcel/CSharpJExcel.zip

3.引用PDFBox讀取pdf文檔的內容
Pdfbox.jar下載地址
http://labs.renren.com/apache-mirror/pdfbox/1.1.0/pdfbox-1.1.0.jar
http://apache.etoak.com/pdfbox/1.1.0/pdfbox-1.1.0.jar
http://apache.freelamp.com/pdfbox/1.1.0/pdfbox-1.1.0.jar
Fontbox.jar下載地址
http://apache.etoak.com/pdfbox/1.1.0/fontbox-1.1.0.jar
http://labs.renren.com/apache-mirror/pdfbox/1.1.0/fontbox-1.1.0.jar
http://apache.freelamp.com/pdfbox/1.1.0/fontbox-1.1.0.jar
Jempbox.jar下載地址
http://labs.renren.com/apache-mirror/pdfbox/1.1.0/jempbox-1.1.0.jar
http://apache.etoak.com/pdfbox/1.1.0/jempbox-1.1.0.jar
http://apache.freelamp.com/pdfbox/1.1.0/jempbox-1.1.0.jar
下面我們就來簡單看一下這些jar包的對文檔的讀取的應用實例:
pom座標爲:

.....
    <properties>
        <poi.version>3.13</poi.version>
        <pdf.version>1.8.10</pdf.version>
    </properties>
....
<!-- 3個jar版本必須統一 -->
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>${poi.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi</artifactId>
            <version>${poi.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-scratchpad</artifactId>
            <version>${poi.version}</version>
        </dependency>
        <!-- pdf -->
        <dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>pdfbox</artifactId>
            <version>${pdf.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>fontbox</artifactId>
            <version>${pdf.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>jempbox</artifactId>
            <version>${pdf.version}</version>
        </dependency>
        <dependency>
            <groupId>net.sourceforge.jexcelapi</groupId>
            <artifactId>jxl</artifactId>
            <version>2.6.12</version>
        </dependency>

讀取內容代碼

    public static String getPDFContent(File file) {
        String context = null;
        InputStream f = null;
        try {
            f = new FileInputStream(file);
            PDFParser p = new PDFParser(f);
            p.parse();
            PDDocument pdd = p.getPDDocument();
            PDFTextStripper ts = new PDFTextStripper();
            context = ts.getText(pdd);
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                f.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        return context == null ? "" : context;
    }

    public static String getXLSContent(File file) {

        StringBuilder sb = new StringBuilder();
        InputStream f = null;
        try {
            f = new FileInputStream(file);
            jxl.Workbook rwb = Workbook.getWorkbook(f);
            Sheet[] sheet = rwb.getSheets();
            for (int i = 0; i < sheet.length; i++) {
                Sheet rs = rwb.getSheet(i);
                for (int j = 0; j < rs.getRows(); j++) {
                    Cell[] cells = rs.getRow(j);
                    for (int k = 0; k < cells.length; k++)
                        sb.append(cells[k].getContents());
                }
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                f.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        return sb.toString() == null ? null : sb.toString();
    }

    public static String getWord2003(File file) {
        String word2003 = null;
        InputStream f = null;
        try {
            f = new FileInputStream(file);
            WordExtractor ex = new WordExtractor(f);
            word2003 = ex.getText();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                f.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        return word2003;
    }

    public static String getWord2007(File file) {
        String word2007 = null;
        try {
            OPCPackage opcPackage = POIXMLDocument
                    .openPackage(file.getParent());
            POIXMLTextExtractor extractor = new XWPFWordExtractor(opcPackage);
            word2007 = extractor.getText();
        } catch (Exception e) {
            e.printStackTrace();
        }
        return word2007;
    }

另外 我把獲取文件的MIME方法也寫上吧

public static String getFileMime(File file) {
        if (!file.exists()) {
            System.err.println("文件不存在");
            return null;
        }
        MagicMatch match = null;
        try {
            match = Magic.getMagicMatch(file, false);
        } catch (Exception e) {
            e.printStackTrace();
        }
        return match.getMimeType();
    }

版權所有,轉載請標註

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章