1

2023-01-30  

Since the Chinese language package “Chi_sim” of Tesseract’s Chinese language package “CHI_SIM” is not high in recognition of Chinese handwriting fonts or environments. Therefore, it is necessary to use its own samples to improve the recognition rate. Your own language library.

sudo apt install tesseract-ocr

Step:

1. Tool preparation:

(1) Official document:https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

(2) Java virtual machine, because JTESSBOXEDITOR’s operation depends on the running environment of the Java, needs to install the Java virtual machine.

download address:http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

(3) JTESSBOXEditor2.0 tools for adjusting the content and location of the text on the picture,

download address:https://sourceforge.net/projects/vietocr/files/jTessBoxEditor/

Install the “JTESSBOXEditor.jar” after decompression, and you can open the tool.

java -jar jtessboxeditor.jar // linux

2, sample image preparation: (the more the number of samples of the training, the better the better)

Here only two different font samples are prepared for testing:

在这里插入图片描述
3. Use JTESSBOXEditor to generate the combined TIF picture of the training sample:

(1) Open JTESSBOXEditor, select Tools-> Merge Tiff, enter the folder where the training sample is located, and select the sample picture to participate in the training:
在这里插入图片描述

(2) Click “Open” to pop up the saving dialog box. Select it under the current path. The file is named “zwp.test.exp0.tif”. The format is only available in “TIFF”.

TIF text naming format [lang]. [Fontname] .exp [num] .tif
Lang is a language, Fontname is a font, and NUM is a custom number.

For example, we want to train a custom font zwp, font name test, then we named the picture file zwp.test.exp0.tif
在这里插入图片描述

4, use TESSERACT to generate .BOX file:

The “zwp.test.exp0.tif” file generated by the previous step is to open the command line program under the directory of the “zwp.test.exp0.tif” file, execute the following command, and generate zwp.test.exp0.box files after executing.

tesseract zwp.test.exp0.tif zwp.test.exp0 -l chi_sim -psm 7 batch.nochop makebox

5. Use JTESSBOXEditor to correct the .BOX file error:

.Box files record the location and recognition content of each character on the picture. Before training, you need to use JTESSBOXEditor to adjust the position and content of the character.

Open JTESSBOXEditor and click Box Editor-> Open to open the “zwp.test.exp0.tif” generated in the step 2, which will automatically associate to the “zwp.test.exp0.box” file. These two files require Essence After the adjustment, click “Save” to save the modification. (Merge, split, insert.delete and other operations. At the same time, you can adjust the box containing words.
在这里插入图片描述

6. Generate FONT_PROPERTIES file: (this file has no suffix name)

(1) execute the command. After executing, the FONT_PROPERTIES file will be generated in the current directory

echo test 0 0 0 0 0 >font_properties

(2) You can also create a new text file called FONT_PROPERTIES. Enter the content “test 0 0 0 0 0” to indicate a total of 5 attributes of the font TEST. The “test” here must be consistent with the “test” name in “zwp.test.exp0.box”.

7. Use TESSERACT to generate .tr training files:

execute the following command. After the execution is completed, the zwp.test.exp0.tr file will be generated in the current directory.

tesseract zwp.test.exp0.tif zwp.test.exp0 nobatch box.train

8. Generate character set file:

execute the following command: After the execution is performed, a file called “Unicharset” will be generated in the current directory.

unicharset_extractor zwp.test.exp0.box

9. Generate Shape file:

execute the following command. After the execution is completed, two files are generated by Shapetable and ZWP.Unicharset.

shapeclustering -F font_properties -U unicharset -O zwp.unicharset zwp.test.exp0.tr

10. Generate July Character Feature File:

execute the following command, which will generate four files: Inttemp, PFFMTable, Shapetable and Zwp.Unicharset.

mftraining -F font_properties -U unicharset -O zwp.unicharset zwp.test.exp0.tr

11. Generate character normalization feature file:

execute the following command to generate the normproto file.

cntraining zwp.test.exp0.tr

12. File renamed:

The names of the four files of

renamed Inttemp, PFFMTable, Shapetable, and Normproto are [lang] .xxx.

here to modify it to ZWP.INTTEMP, ZWP.PFFMTable, ZWP.Shapetable and Zwp.normProto

execute the following command:

mv normproto zwp.normproto
mv inttemp zwp.inttemp
mv pffmtable zwp.pffmtable
mv shapetable zwp.shapetable

13. Merge training file:

execute the following command to generate zwp.traineData files.

combine_tessdata zwp.

在这里插入图片描述
LOG output OFFSET 1, 3, 4, 5, 13 (do not care about the rest), indicating that the new language package is successful.

** Copy the generated “ZWP.TRAINEDDATA” language package file to the TESSSDATA folder in the Tesseract-OCR installation directory, and you can use the training-generated language package for image text. **

14, test:

Input the following command, -L is a language package generated after training. (Of course, before entering the following command, copy the generated “ZWP.TRAINEDDATA” language package file to the TESSDATA folder in the TESSERACT-OCR installation directory, and you can use the trained language package for image text to recognize)

tesseract test.PNG test -l zwp

After using the newly trained language package for text recognition, you will find that the text that cannot be recognized before can be identified.

You only need to have TIF and BOX files to merge. The steps above are almost the same as Step 6. The only difference is the name. Test.exp0.tif. This test (defined by itself) can be the same, and then Step 6 is the same as above, such as zwp.test.exp0.tif
zwp.test.exp1.tif, etc.
From steps from 7, the first thing to generate is the TR file of all BOX files

tesseract zwp.test.exp0.tif zwp.test.exp0  nobatch box.train
tesseract zwp.test.exp1.tif zwp.test.exp1  nobatch box.train
tesseract zwp.test.exp2.tif zwp.test.exp2  nobatch box.train

8. Extract character

unicharset_extractor zwp.test.exp0.box zwp.test.exp2.box zwp.test.exp3.box

9. After executing the following command, after executing, the two files of Shapetable and Unicharset will be generated

shapeclustering -F font_properties -U unicharset -O unicharset zwp.test.exp0.tr zwp.test.exp1.tr zwp.test.exp2.tr

10. Generate July Character Feature File:

execute the following command, which will generate four files: Inttemp, PFFMTable, Shapetable, and Unicharset.

mftraining -F font_properties -U unicharset -O unicharset zwp.test.exp0.tr

11, generate characteristic characteristics file:

execute the following command, which will generate normproto files.

cntraining zwp.test.exp0.tr

12. Then rename all the five files generated above to the beginning of ZWP, such as: zwp.inttemp
execute the following command:

mv unicharset zwp.unicharset
mv normproto zwp.normproto
mv inttemp zwp.inttemp
mv pffmtable zwp.pffmtable
mv shapetable zwp.shapetable

13. Merge all files and generate a large font file,

combine_tessdata zwp.

Original text:https://blog.csdn.net/a745233700/article/details/80175883

source

Related Posts

EasyStorr, more elegant and convenient access to data objects, accessing the local area like MAP.

English paper writing precautions for finishing

C# decimal number and ASCII code conversion XSZ

webView and JS interact all methods and use

1

Random Posts

About Form Form acquisition value and setting worthy method

mobile iOS keyboard to block the bottom input box problem perfect solution

js Local cache three ways JS local cache

Solution Python Install Package Slow

UVA10635 Prince and Princess (LCS to LIS)