Java使用POI提取word, Excel, PPt, txt的文本內容及文件屬性中的作者
新公司實習的第一個任務,在網上查了一些博客後接觸到了poi,它爲Java提供API對Microsoft Office文件進行讀寫操作的功能。
可以在apache官網下載jar包http://poi.apache.org/download.html
查看API文檔http://poi.apache.org/components/index.html
1、新建普通的maven項目
poi的jar包較多,於是選用maven倉庫導入,先建一個普通的maven項目
然後next,再起項目名就可以了
2、在pom.xml裏添加poi的依賴
在標籤組裏添加
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>3.17</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-scratchpad</artifactId>
<version>3.17</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>3.17</version>
</dependency>
3、提取word文本和作者
一開始只知道查看別人博客給出代碼,但很多都跟自己需要的不一樣,而且不完整、導包環境不一樣等,總是不滿意,搜索很花時間而且效果也不太好,於是試着直接去參考官網上給出的example
http://poi.apache.org/components/document/quick-guide.html
HWPF對應.doc類型的文件,XWPF對應.docx類型的文件,Excel、PPt也是類似的
import com.google.common.base.CharMatcher;
import com.google.common.collect.Lists;
import com.google.gson.Gson;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;
import java.util.List;
public class WordUtil {
public static String readWordFile(String path) {
List<String> contextList = Lists.newArrayList();
InputStream inputStream = null;
try {
inputStream = new FileInputStream(new File(path));
if (path.endsWith(".doc")) {
HWPFDocument document = new HWPFDocument(inputStream);
System.out.println("作者:"+document.getSummaryInformation().getAuthor());
WordExtractor extractor = new WordExtractor(document);
String[] contextArray = extractor.getParagraphText();
Arrays.asList(contextArray).forEach(context -> contextList.add(CharMatcher.whitespace().removeFrom(context)));
extractor.close();
document.close();
} else if (path.endsWith(".docx")) {
XWPFDocument document = new XWPFDocument(inputStream).getXWPFDocument();
System.out.println("作者:"+document.getProperties().getCoreProperties().getCreator());
List<XWPFParagraph> paragraphList = document.getParagraphs();
paragraphList.forEach(paragraph -> contextList.add(CharMatcher.whitespace().removeFrom(paragraph.getParagraphText())));
document.close();
} else {
//LOGGER.debug("此文件{}不是word文件", path);
return "此文件不是Word文件"+path;
}
} catch (IOException e) {
e.printStackTrace();
} finally {
if (inputStream != null) try {
inputStream.close();
} catch (IOException e) {
e.printStackTrace();
//LOGGER.debug("讀取word文件失敗");
System.out.println("讀取Word文件失敗");
}
}
return new Gson().toJson(contextList);
}
}
使用了Google的Guava工具類去做集合和字符串操作,前提是在pom.xml里加上它的依賴
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>21.0</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>24.0-jre</version>
</dependency>
使用了Google的gson把集合類轉換成JSON類型(又好像是JavaBean類型),同樣要添加依賴
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.2.4</version>
</dependency>
在代碼中可以看到,使用了getSummaryInformation()去獲得文檔的摘要信息,再從摘要信息中獲取需要的作者信息。
一開始不知道怎麼使用poi獲取作者信息,以爲只要使用jdk的方法就能獲取文件屬性,但發現只能獲取文件的基本屬性,比如
Path testPath = Paths.get("E:\\test\\test1.xls");
FileOwnerAttributeView ownerView = Files.getFileAttributeView(testPath, FileOwnerAttributeView.class);
System.out.println("文件所有者:" + ownerView.getOwner());
BasicFileAttributes attrs = Files.readAttributes(testPath, BasicFileAttributes.class);
這些類和方法參考了Oracle上有關IO流的jdk文檔
https://docs.oracle.com/javase/tutorial/essential/io/fileio.html
https://docs.oracle.com/javase/8/docs/api/java/nio/file/Files.html#getAttribute-java.nio.file.Path-java.lang.String-java.nio.file.LinkOption…-
然後又發現一個類似poi可以操作office文件的工具,Spire
https://www.e-iceblue.com/Download/doc-for-java-free.html
先添加依賴
<repositories>
<repository>
<id>com.e-iceblue</id>
<name>e-iceblue</name>
<url>http://repo.e-iceblue.com/nexus/content/groups/public/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>e-iceblue</groupId>
<artifactId>spire.doc.free</artifactId>
<version>2.0.0</version>
</dependency>
</dependencies>
讀取文檔屬性
Document doc = new Document("E:\\test\\test1.doc");
//讀取內置文檔屬性
System.out.println("作者: " + doc.getBuiltinDocumentProperties().getAuthor());
但是沒有看到java的XLS包,而且使用依賴導Presentation包時maven總是找不到對應的包,於是放棄了。想到同樣是操作office類型的,那poi自己肯定也有類似獲取文件屬性的方法,就找到了如上代碼中的,其他Office類型也類似。
4、提取Excel文本和作者
看到個好劉逼好想詳細的POI Excel操作博客:https://www.cnblogs.com/huajiezh/p/5467821.html
import com.google.common.collect.Lists;
import com.google.gson.Gson;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.List;
public class ExcelUtil {
public static String readExcelFile(String path){
List<List<String>> rowlist = Lists.newArrayList();
InputStream inputStream = null;
String str = "";
Workbook workbook = null;
try {
//獲取文件輸入流
inputStream = new FileInputStream(new File(path));
//獲取Excel工作簿對象
if (path.endsWith(".xls")) {
workbook = new HSSFWorkbook(inputStream);
System.out.println("作者:" + ((HSSFWorkbook) workbook).getSummaryInformation().getAuthor());
}else if (path.endsWith(".xlsx")) {
workbook = new XSSFWorkbook(inputStream);
System.out.println("作者:" + ((XSSFWorkbook) workbook).getProperties().getCoreProperties().getCreator());
}
else {
//LOGGER.debug("此文件{}不是word文件", path);
return "此文件不是Excel文件" + path;
}
//得到Excel工作表對象
for (Sheet sheet : workbook ) {
for (Row row : sheet) {
//首行(即表頭)不讀取
if (row.getRowNum() == 0) {
continue;
}
List<String> cellList = Lists.newArrayList();
for (Cell cell : row) {
switch (cell.getCellTypeEnum()) {
case STRING:
cellList.add(cell.getRichStringCellValue().getString());
break;
case NUMERIC:
if (DateUtil.isCellDateFormatted(cell)) {
cellList.add(""+cell.getDateCellValue());
} else {
cellList.add(""+cell.getNumericCellValue());
}
break;
case BOOLEAN:
cellList.add(""+cell.getBooleanCellValue());
break;
case FORMULA:
cellList.add(cell.getCellFormula());
break;
case BLANK:
cellList.add("");
break;
default:
cellList.add("");
}
}
if (cellList.size() > 0)
rowlist.add(cellList);
}
}
Gson gson = new Gson();
str = gson.toJson(rowlist);
//關閉流
workbook.close();
} catch (IOException e) {
e.printStackTrace();
}finally {
if (inputStream != null) try {
inputStream.close();
} catch (IOException e) {
e.printStackTrace();
//LOGGER.debug("讀取word文件失敗");
System.out.println("讀取Excel文件失敗");
}
}
return str;
}
}
5、提取PPt文本和作者
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.List;
import com.google.common.collect.Lists;
import com.google.gson.Gson;
import org.apache.poi.hslf.usermodel.HSLFSlideShow;
import org.apache.poi.sl.usermodel.Shape;
import org.apache.poi.sl.usermodel.Slide;
import org.apache.poi.sl.usermodel.SlideShow;
import org.apache.poi.sl.usermodel.TextShape;
import org.apache.poi.xslf.usermodel.XMLSlideShow;
public class PPtUtil {
//直接抽取幻燈片的全部內容
public static String readPPtFile(String path) {
List<String> textList = Lists.newArrayList();
InputStream inputStream = null;
SlideShow ppt = null;
try {
//獲取文件輸入流
inputStream = new FileInputStream(new File(path));
if (path.endsWith(".ppt")) {
ppt = new HSLFSlideShow(inputStream);
System.out.println("作者:" + ((HSLFSlideShow)ppt).getSlideShowImpl().getSummaryInformation().getAuthor());
}else if (path.endsWith(".pptx")) {
ppt = new XMLSlideShow(inputStream);
System.out.println("作者:" + ((XMLSlideShow)ppt).getProperties().getCoreProperties().getCreator());
}
else {
//LOGGER.debug("此文件{}不是word文件", path);
return "此文件不是PPt文件" + path;
}
// get slides
List<Slide> slides = ppt.getSlides();
for (Slide slide : slides) {
List<Shape> shapes = slide.getShapes();
for (Shape sh : shapes) {
//如果是一個文本框
if (sh instanceof TextShape) {
TextShape shape = (TextShape) sh;
textList.add(shape.getText());
}
}
}
} catch (IOException e) {
e.printStackTrace();
}finally {
if (inputStream != null) try {
inputStream.close();
} catch (IOException e) {
e.printStackTrace();
//LOGGER.debug("讀取word文件失敗");
System.out.println("讀取PPt文件失敗");
}
}
return new Gson().toJson(textList);
}
}
6、提取Txt文本和作者
import com.google.common.collect.Lists;
import com.google.gson.Gson;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.List;
public class TxtUtil {
public static String readTxtFile(String path) {
List<String> txtList = Lists.newArrayList();
FileReader fileReader = null;
BufferedReader bufferedReader = null;
try {
if (path.endsWith(".txt")) {
fileReader = new FileReader(path);
bufferedReader = new BufferedReader(fileReader);
String s = "";
while ((s = bufferedReader.readLine()) != null) {
txtList.add(s);
}
} else {
return "此文件不是Txt文件" + path;
}
} catch (IOException e) {
e.printStackTrace();
}finally {
if (fileReader != null && bufferedReader != null) try {
fileReader.close();
bufferedReader.close();
} catch (IOException e) {
e.printStackTrace();
//LOGGER.debug("讀取word文件失敗");
System.out.println("讀取Txt文件失敗");
}
}
return new Gson().toJson(txtList);
}
}
7、測試類
import poiUtils.ExcelUtil;
import poiUtils.PPtUtil;
import poiUtils.TxtUtil;
import poiUtils.WordUtil;
public class App
{
public static void main(String[] args ){
System.out.println("E:\\test\\test1.doc:"+new WordUtil().readWordFile("E:\\test\\test1.doc"));
System.out.println("E:\\test\\test1.docx:"+new WordUtil().readWordFile("E:\\test\\test1.docx"));
System.out.println("E:\\test\\test1.xls:"+new ExcelUtil().readExcelFile("E:\\test\\test1.xls"));
System.out.println("E:\\test\\test1.xlsx:"+new ExcelUtil().readExcelFile("E:\\test\\test1.xlsx"));
System.out.println("E:\\test\\test1.ppt:"+new PPtUtil().readPPtFile("E:\\test\\test1.ppt"));
System.out.println("E:\\test\\test1.pptx:"+new PPtUtil().readPPtFile("E:\\test\\test1.pptx"));
System.out.println("E:\\test\\test1.txt:"+new TxtUtil().readTxtFile("E:\\test\\test1.txt"));
}
}
輸出結果:
作者:ying
E:\test\test1.doc:["文檔內容","第一段","第二段"]
作者:ying
E:\test\test1.docx:["文檔內容","第一段","第二段"]
作者:ying
E:\test\test1.xls:[["小王","男","19.0","Mon Nov 08 00:00:00 CST 1999","true"],["大王","女","20.0","Wed Nov 04 00:00:00 CST 1998","false"]]
作者:ying
E:\test\test1.xlsx:[["小王","男","19.0","Mon Nov 08 00:00:00 CST 1999","true"],["大王","女","20.0","Wed Nov 04 00:00:00 CST 1998","false"]]
作者:ying
E:\test\test1.ppt:["這個標題","這個內容"]
作者:ying
E:\test\test1.pptx:["這個標題","這個內容"]
E:\test\test1.txt:["txt內容","第二行","第三行"]