Tesseract OCR 使用技術【springboot】

原創

2023-08-28 23:47

Tesseract 是一個 OCR 庫, Tesseract 是目前公認最優秀、最精確的開源 OCR 系統，除了極高的精確度，Tesseract 也具有很高的靈活性。它可以通過訓練識別出任何字體，也可以識別出任何 Unicode 字符。

安裝Tesseract-OCR
1. 1 首先根據需求下載需要的Tesseract版本,3.0以上支持中文，本文以爲例
Tesseract下載：https://digi.bib.uni-mannheim.de/tesseract/

1.1.1 下載完畢後運行 .exe 文件，按提示進行安裝；假設安裝位置爲： “D:\Program Files\Tesseract-OCR”

1.1.2 增加系統變量 TESSDATA_PREFIX ，值爲Tesseract-OCR 文件中的 tessdata 路徑；此處爲 “D:\Program Files\Tesseract-OCR”

1.1.3 在 cmd 中輸入 tesseract -v ，能正常顯示版本信息則安裝成功

1.1.4 下載對應版本的字庫

1.1.5 字庫地址：https://tesseract-ocr.github.io/tessdoc/Data-Files

1.1.6 下載簡體中文字庫 chi_sim.traineddata ，英文字庫 eng.traineddata默認自帶了不需要下載

1.1.7 將文件放到 tessdata 文件夾中

在springboot中使用

2.1 引入依賴

<dependency>
  <groupId>net.sourceforge.tess4j</groupId>
  <artifactId>tess4j</artifactId>
  <version>5.6.0</version>
</dependency>

2.2 示例代碼

  public boolean dosoc(String fileName) throws TesseractException, IOException {
      try {
        File file = new File(fileName);
        Tesseract tesseract = new Tesseract();
        tesseract.setDatapath("D:\\Program Files\\Tesseract-OCR\\tessdata");
        tesseract.setLanguage("chi_sim");
        String result = tesseract.doOCR(file);
        writeObj(result);
        return true;
      } catch (Exception e) {
        e.printStackTrace();
        return false;
      }
    }

public void writeObj(String result) throws IOException {
      String newFileName = "F:\\demo1\\result1.txt";
      try {
        // 防止文件建立或讀取失敗，用catch捕捉錯誤並打印，也可以throw
        /* 寫入Txt文件 */
        File writename = new File(newFileName);// 相對路徑，如果沒有則要建立一個新的output。txt文件
        writename.createNewFile(); // 創建新文件
        BufferedWriter out = new BufferedWriter(new FileWriter(writename, true));
        out.write(result);

        System.out.println("寫入成功！");
        out.flush(); // 把緩存區內容壓入文件
        out.close(); // 最後記得關閉文件

      } catch (Exception e) {
        e.printStackTrace();
      }

    }

如果是圖片字符串base六十四加密的

   Tesseract tesseract = new Tesseract();
    tesseract.setDatapath("D:\\Program Files\\Tesseract-OCR\\tessdata");
    tesseract.setLanguage("chi_sim");
    byte[] decode = base六十四.decode(aa);
    ByteArrayInputStream in = new ByteArrayInputStream(decode); //將b作爲輸入流；
    try {
      BufferedImage image = ImageIO.read(in); //將in作爲輸入流
      String result = tesseract.doOCR(image);
      System.out.println(result);
    } catch (IOException | TesseractException e) {
      throw new RuntimeException(e);
    }

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Tesseract OCR 使用技術【springboot】

Tesseract 是一個 OCR 庫, Tesseract 是目前公認最優秀、最精確的開源 OCR 系統，除了極高的精確度，Tesseract 也具有很高的靈活性。它可以通過訓練識別出任何字體，也可以識別出任何 Unicode 字符。

如果是圖片字符串base六十四加密的

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

一文教你在MindSpore中實現A2C算法訓練

盤點一個Pandas數據分組的問題

sublime text4 定製記錄

sublime text定製

JimuReport 積木報表 v1.7.5 版本發佈，免費的JAVA報表工具

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Tesseract OCR 使用技術 【springboot】

Tesseract 是一個 OCR 庫, Tesseract 是目前公認最優秀、最精確的開源 OCR 系統，除了極高的精確度，Tesseract 也具有很高的靈活性。它可以通過訓練識別出任何字體，也可以識別出任何 Unicode 字符。

如果是圖片字符串base六十四加密的

Tesseract OCR 使用技術【springboot】