Tesseract是一個開源的OCR引擎,支持多國語言,其官方地址:https://github.com/tesseract-ocr/tesseract
文檔地址:https://tesseract-ocr.github.io/docs/
1.MAC下安裝Tesseract
命令安裝brew install --with-training-tools tesseract,現在提示 Error: invalid option: --with-training-tools,沒有--with-training-tools參數,想把訓練工具training-tools一起安裝了,最後採用編譯的方式安裝
# Packages which are always needed.
brew install automake autoconf libtool
brew install pkgconfig
brew install icu4c
brew install leptonica
# Packages required for training tools.
brew install pango
# Optional packages for extra features.
brew install libarchive
# Optional package for builds using g++.
brew install gcc
git clone https://github.com/tesseract-ocr/tesseract/
cd tesseract
./autogen.sh
mkdir build
cd build
# Optionally add CXX=g++-8 to the configure command if you really want to use a different compiler.
../configure PKG_CONFIG_PATH=/usr/local/opt/icu4c/lib/pkgconfig:/usr/local/opt/libarchive/lib/pkgconfig:/usr/local/opt/libffi/lib/pkgconfig
make -j
# Optionally install Tesseract.
sudo make install
# Optionally build and install training tools.
make training
sudo make training-install
之後下載語言包
下載.traineddata文件 並且拷貝到tessdata文件夾下。
具體語言包地址:https://github.com/tesseract-ocr/tessdata
都執行完後,可以控制檯執行命令看一下識別的結果:tesseract 111.jpg stdout
安裝參考文章:https://www.freesion.com/article/1345377723/
2.Java語言識別,tess4j開發OCR識別
引入tess4j的maven依賴
<!-- https://mvnrepository.com/artifact/net.sourceforge.tess4j/tess4j -->
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>4.5.1</version>
</dependency>
執行識別demo代碼:
public class Tess4jOcrTest {
public static void main(String[] args) {
String bath = "/Users/seapeak/Desktop/";
test1(bath + "555.jpg");
}
/**
* 根據路徑識別文字結果
* @param path
*/
public static void test1(String path) {
File file = new File(path);
ITesseract it = new Tesseract();
// 如果沒有改變tessdata目錄位置請輸入.
// it.setDatapath(".");
// // 如果變更過tessdata目錄請指定位置
it.setDatapath("/Users/seapeak/Desktop/it/java/tesseract/tessdata/");
//如果是漢字居多設置語言,如果字符偏多設置eng
it.setLanguage("chi_sim");
try {
String result = it.doOCR(file);
log.info("識別結果:"+result );
} catch (TesseractException e) {
// TODO Auto-generated catch block
e.printStackTrace();
log.error("Tess4jOcrTest TesseractException:{}",e);
}
}
}
執行時如果如下未找到language的錯誤,則設置setDatapath的tessdata目錄
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'eng'
參考文檔:https://blog.csdn.net/chenhailonghp/article/details/102704842
3.training-tools 訓練工具 的使用,待續
可以參考:
https://blog.csdn.net/dcrmg/article/details/53677739
https://www.freesion.com/article/1345377723/