Java使用POI提取word, Excel, PPt, txt的文本內容及文件屬性中的作者

新公司實習的第一個任務，在網上查了一些博客後接觸到了poi，它爲Java提供API對Microsoft Office文件進行讀寫操作的功能。
可以在apache官網下載jar包http://poi.apache.org/download.html

查看API文檔http://poi.apache.org/components/index.html

1、新建普通的maven項目

poi的jar包較多，於是選用maven倉庫導入，先建一個普通的maven項目

然後next，再起項目名就可以了

2、在pom.xml裏添加poi的依賴

在標籤組裏添加

<dependency>
      <groupId>org.apache.poi</groupId>
      <artifactId>poi</artifactId>
      <version>3.17</version>
    </dependency>
    <dependency>
      <groupId>org.apache.poi</groupId>
      <artifactId>poi-scratchpad</artifactId>
      <version>3.17</version>
    </dependency>
    <dependency>
      <groupId>org.apache.poi</groupId>
      <artifactId>poi-ooxml</artifactId>
      <version>3.17</version>
    </dependency>

3、提取word文本和作者

一開始只知道查看別人博客給出代碼，但很多都跟自己需要的不一樣，而且不完整、導包環境不一樣等，總是不滿意，搜索很花時間而且效果也不太好，於是試着直接去參考官網上給出的example
http://poi.apache.org/components/document/quick-guide.html

HWPF對應.doc類型的文件，XWPF對應.docx類型的文件，Excel、PPt也是類似的

import com.google.common.base.CharMatcher;
import com.google.common.collect.Lists;
import com.google.gson.Gson;

import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;
import java.util.List;

public class WordUtil {
    public static String readWordFile(String path) {
        List<String> contextList = Lists.newArrayList();
        InputStream inputStream = null;
        try {
            inputStream = new FileInputStream(new File(path));
            if (path.endsWith(".doc")) {
                HWPFDocument document = new HWPFDocument(inputStream);
                System.out.println("作者："+document.getSummaryInformation().getAuthor());
                WordExtractor extractor = new WordExtractor(document);
                String[] contextArray = extractor.getParagraphText();
                Arrays.asList(contextArray).forEach(context -> contextList.add(CharMatcher.whitespace().removeFrom(context)));
                extractor.close();
                document.close();
            } else if (path.endsWith(".docx")) {
                XWPFDocument document = new XWPFDocument(inputStream).getXWPFDocument();
                System.out.println("作者："+document.getProperties().getCoreProperties().getCreator());
                List<XWPFParagraph> paragraphList = document.getParagraphs();
                paragraphList.forEach(paragraph -> contextList.add(CharMatcher.whitespace().removeFrom(paragraph.getParagraphText())));
                document.close();
            } else {
                //LOGGER.debug("此文件{}不是word文件", path);
                return "此文件不是Word文件"+path;
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (inputStream != null) try {
                inputStream.close();
            } catch (IOException e) {
                e.printStackTrace();
                //LOGGER.debug("讀取word文件失敗");
                System.out.println("讀取Word文件失敗");
            }
        }
        return new Gson().toJson(contextList);
    }
}

使用了Google的Guava工具類去做集合和字符串操作，前提是在pom.xml里加上它的依賴

<dependency>
      <groupId>com.google.guava</groupId>
      <artifactId>guava</artifactId>
      <version>21.0</version>
    </dependency>

    <dependency>
      <groupId>com.google.guava</groupId>
      <artifactId>guava</artifactId>
      <version>24.0-jre</version>
    </dependency>

使用了Google的gson把集合類轉換成JSON類型（又好像是JavaBean類型）,同樣要添加依賴

<dependency>
          <groupId>com.google.code.gson</groupId>
          <artifactId>gson</artifactId>
          <version>2.2.4</version>
      </dependency>

在代碼中可以看到，使用了getSummaryInformation()去獲得文檔的摘要信息，再從摘要信息中獲取需要的作者信息。

一開始不知道怎麼使用poi獲取作者信息，以爲只要使用jdk的方法就能獲取文件屬性，但發現只能獲取文件的基本屬性，比如

Path testPath = Paths.get("E:\\test\\test1.xls");
        FileOwnerAttributeView ownerView = Files.getFileAttributeView(testPath, FileOwnerAttributeView.class);
        System.out.println("文件所有者：" + ownerView.getOwner());

BasicFileAttributes attrs = Files.readAttributes(testPath, BasicFileAttributes.class);

這些類和方法參考了Oracle上有關IO流的jdk文檔
https://docs.oracle.com/javase/tutorial/essential/io/fileio.html
https://docs.oracle.com/javase/8/docs/api/java/nio/file/Files.html#getAttribute-java.nio.file.Path-java.lang.String-java.nio.file.LinkOption…-

然後又發現一個類似poi可以操作office文件的工具，Spire
https://www.e-iceblue.com/Download/doc-for-java-free.html

先添加依賴

<repositories>
        <repository>
            <id>com.e-iceblue</id>
            <name>e-iceblue</name>
            <url>http://repo.e-iceblue.com/nexus/content/groups/public/</url>
        </repository>
    </repositories>
    
    <dependencies>
    <dependency>
          <groupId>e-iceblue</groupId>
          <artifactId>spire.doc.free</artifactId>
          <version>2.0.0</version>
      </dependency>
  </dependencies>

讀取文檔屬性

Document doc = new Document("E:\\test\\test1.doc");
//讀取內置文檔屬性
System.out.println("作者： " + doc.getBuiltinDocumentProperties().getAuthor());

但是沒有看到java的XLS包，而且使用依賴導Presentation包時maven總是找不到對應的包，於是放棄了。想到同樣是操作office類型的，那poi自己肯定也有類似獲取文件屬性的方法，就找到了如上代碼中的，其他Office類型也類似。

4、提取Excel文本和作者

看到個好劉逼好想詳細的POI Excel操作博客：https://www.cnblogs.com/huajiezh/p/5467821.html

import com.google.common.collect.Lists;
import com.google.gson.Gson;

import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.List;

public class ExcelUtil {
    public static String readExcelFile(String path){
        List<List<String>> rowlist = Lists.newArrayList();
        InputStream inputStream = null;
        String str = "";
        Workbook workbook = null;
        try {
            //獲取文件輸入流
            inputStream = new FileInputStream(new File(path));
            //獲取Excel工作簿對象
            if (path.endsWith(".xls")) {
                workbook = new HSSFWorkbook(inputStream);
                System.out.println("作者：" + ((HSSFWorkbook) workbook).getSummaryInformation().getAuthor());
            }else if (path.endsWith(".xlsx")) {
                workbook = new XSSFWorkbook(inputStream);
                System.out.println("作者：" + ((XSSFWorkbook) workbook).getProperties().getCoreProperties().getCreator());
            }
            else {
                //LOGGER.debug("此文件{}不是word文件", path);
                return "此文件不是Excel文件" + path;
            }
            //得到Excel工作表對象
            for (Sheet sheet : workbook ) {
                for (Row row : sheet) {
                    //首行（即表頭）不讀取
                    if (row.getRowNum() == 0) {
                        continue;
                    }
                    List<String> cellList = Lists.newArrayList();
                    for (Cell cell : row) {
                        switch (cell.getCellTypeEnum()) {
                            case STRING:
                                cellList.add(cell.getRichStringCellValue().getString());
                                break;
                            case NUMERIC:
                                if (DateUtil.isCellDateFormatted(cell)) {
                                    cellList.add(""+cell.getDateCellValue());
                                } else {
                                    cellList.add(""+cell.getNumericCellValue());
                                }
                                break;
                            case BOOLEAN:
                                cellList.add(""+cell.getBooleanCellValue());
                                break;
                            case FORMULA:
                                cellList.add(cell.getCellFormula());
                                break;
                            case BLANK:
                                cellList.add("");
                                break;
                            default:
                                cellList.add("");
                        }
                    }
                    if (cellList.size() > 0)
                        rowlist.add(cellList);
                }
            }
            Gson gson = new Gson();
            str = gson.toJson(rowlist);
            //關閉流
            workbook.close();
        } catch (IOException e) {
            e.printStackTrace();
        }finally {
            if (inputStream != null) try {
                inputStream.close();
            } catch (IOException e) {
                e.printStackTrace();
                //LOGGER.debug("讀取word文件失敗");
                System.out.println("讀取Excel文件失敗");
            }
        }
        return str;
	}
}

5、提取PPt文本和作者

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.List;

import com.google.common.collect.Lists;
import com.google.gson.Gson;
import org.apache.poi.hslf.usermodel.HSLFSlideShow;
import org.apache.poi.sl.usermodel.Shape;
import org.apache.poi.sl.usermodel.Slide;
import org.apache.poi.sl.usermodel.SlideShow;
import org.apache.poi.sl.usermodel.TextShape;
import org.apache.poi.xslf.usermodel.XMLSlideShow;


public class PPtUtil {
    //直接抽取幻燈片的全部內容
    public static String readPPtFile(String path) {
        List<String> textList = Lists.newArrayList();
        InputStream inputStream = null;
        SlideShow ppt = null;
        try {
            //獲取文件輸入流
            inputStream = new FileInputStream(new File(path));
            if (path.endsWith(".ppt")) {
                ppt = new HSLFSlideShow(inputStream);
                System.out.println("作者：" + ((HSLFSlideShow)ppt).getSlideShowImpl().getSummaryInformation().getAuthor());
            }else if (path.endsWith(".pptx")) {
                ppt = new XMLSlideShow(inputStream);
                System.out.println("作者：" + ((XMLSlideShow)ppt).getProperties().getCoreProperties().getCreator());
            }
            else {
                //LOGGER.debug("此文件{}不是word文件", path);
                return "此文件不是PPt文件" + path;
            }
            // get slides
            List<Slide> slides = ppt.getSlides();

            for (Slide slide : slides) {
                List<Shape> shapes = slide.getShapes();
                for (Shape sh : shapes) {
                    //如果是一個文本框
                    if (sh instanceof TextShape) {
                        TextShape shape = (TextShape) sh;
                        textList.add(shape.getText());
                    }
                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        }finally {
            if (inputStream != null) try {
                inputStream.close();
            } catch (IOException e) {
                e.printStackTrace();
                //LOGGER.debug("讀取word文件失敗");
                System.out.println("讀取PPt文件失敗");
            }
        }
        return new Gson().toJson(textList);
    }
}

6、提取Txt文本和作者

import com.google.common.collect.Lists;
import com.google.gson.Gson;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.List;

public class TxtUtil {
    public static String readTxtFile(String path) {
        List<String> txtList = Lists.newArrayList();
        FileReader fileReader = null;
        BufferedReader bufferedReader = null;
        try {
            if (path.endsWith(".txt")) {
                fileReader = new FileReader(path);
                bufferedReader = new BufferedReader(fileReader);
                String s = "";
                while ((s = bufferedReader.readLine()) != null) {
                    txtList.add(s);
                }
            } else {
                return "此文件不是Txt文件" + path;
            }
        } catch (IOException e) {
            e.printStackTrace();
        }finally {
            if (fileReader != null && bufferedReader != null) try {
                fileReader.close();
                bufferedReader.close();
            } catch (IOException e) {
                e.printStackTrace();
                //LOGGER.debug("讀取word文件失敗");
                System.out.println("讀取Txt文件失敗");
            }
        }
        return new Gson().toJson(txtList);
    }
}

7、測試類

import poiUtils.ExcelUtil;
import poiUtils.PPtUtil;
import poiUtils.TxtUtil;
import poiUtils.WordUtil;

public class App 
{
    public static void main(String[] args ){
        System.out.println("E:\\test\\test1.doc："+new WordUtil().readWordFile("E:\\test\\test1.doc"));
        System.out.println("E:\\test\\test1.docx："+new WordUtil().readWordFile("E:\\test\\test1.docx"));
        System.out.println("E:\\test\\test1.xls："+new ExcelUtil().readExcelFile("E:\\test\\test1.xls"));
        System.out.println("E:\\test\\test1.xlsx："+new ExcelUtil().readExcelFile("E:\\test\\test1.xlsx"));
        System.out.println("E:\\test\\test1.ppt："+new PPtUtil().readPPtFile("E:\\test\\test1.ppt"));
        System.out.println("E:\\test\\test1.pptx："+new PPtUtil().readPPtFile("E:\\test\\test1.pptx"));
        System.out.println("E:\\test\\test1.txt："+new TxtUtil().readTxtFile("E:\\test\\test1.txt"));
    }
}

輸出結果：

作者：ying
E:\test\test1.doc：["文檔內容","第一段","第二段"]
作者：ying
E:\test\test1.docx：["文檔內容","第一段","第二段"]
作者：ying
E:\test\test1.xls：[["小王","男","19.0","Mon Nov 08 00:00:00 CST 1999","true"],["大王","女","20.0","Wed Nov 04 00:00:00 CST 1998","false"]]
作者：ying
E:\test\test1.xlsx：[["小王","男","19.0","Mon Nov 08 00:00:00 CST 1999","true"],["大王","女","20.0","Wed Nov 04 00:00:00 CST 1998","false"]]
作者：ying
E:\test\test1.ppt：["這個標題","這個內容"]
作者：ying
E:\test\test1.pptx：["這個標題","這個內容"]
E:\test\test1.txt：["txt內容","第二行","第三行"]

Java使用POI提取word, Excel, PPt, txt的文本內容及文件屬性中的作者