Since the Chinese language package “Chi_sim” of Tesseract’s Chinese language package “CHI_SIM” is not high in recognition of Chinese handwriting fonts or environments. Therefore, it is necessary to use its own samples to improve the recognition rate. Your own language library.
sudo apt install tesseract-ocr
Step:
1. Tool preparation:
(1) Official document:https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
(2) Java virtual machine, because JTESSBOXEDITOR’s operation depends on the running environment of the Java, needs to install the Java virtual machine.
download address:http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
(3) JTESSBOXEditor2.0 tools for adjusting the content and location of the text on the picture,
download address:https://sourceforge.net/projects/vietocr/files/jTessBoxEditor/
Install the “JTESSBOXEditor.jar” after decompression, and you can open the tool.
java -jar jtessboxeditor.jar // linux
2, sample image preparation: (the more the number of samples of the training, the better the better)
Here only two different font samples are prepared for testing:
3. Use JTESSBOXEditor to generate the combined TIF picture of the training sample:
(1) Open JTESSBOXEditor, select Tools-> Merge Tiff, enter the folder where the training sample is located, and select the sample picture to participate in the training:
(2) Click “Open” to pop up the saving dialog box. Select it under the current path. The file is named “zwp.test.exp0.tif”. The format is only available in “TIFF”.
TIF text naming format [lang]. [Fontname] .exp [num] .tif
Lang is a language, Fontname is a font, and NUM is a custom number.
For example, we want to train a custom font zwp, font name test, then we named the picture file zwp.test.exp0.tif
4, use TESSERACT to generate .BOX file:
The “zwp.test.exp0.tif” file generated by the previous step is to open the command line program under the directory of the “zwp.test.exp0.tif” file, execute the following command, and generate zwp.test.exp0.box files after executing.
tesseract zwp.test.exp0.tif zwp.test.exp0 -l chi_sim -psm 7 batch.nochop makebox
5. Use JTESSBOXEditor to correct the .BOX file error:
.Box files record the location and recognition content of each character on the picture. Before training, you need to use JTESSBOXEditor to adjust the position and content of the character.
Open JTESSBOXEditor and click Box Editor-> Open to open the “zwp.test.exp0.tif” generated in the step 2, which will automatically associate to the “zwp.test.exp0.box” file. These two files require Essence After the adjustment, click “Save” to save the modification. (Merge, split, insert.delete and other operations. At the same time, you can adjust the box containing words.
6. Generate FONT_PROPERTIES file: (this file has no suffix name)
(1) execute the command. After executing, the FONT_PROPERTIES file will be generated in the current directory
echo test 0 0 0 0 0 >font_properties
(2) You can also create a new text file called FONT_PROPERTIES. Enter the content “test 0 0 0 0 0” to indicate a total of 5 attributes of the font TEST. The “test” here must be consistent with the “test” name in “zwp.test.exp0.box”.
7. Use TESSERACT to generate .tr training files:
execute the following command. After the execution is completed, the zwp.test.exp0.tr file will be generated in the current directory.
tesseract zwp.test.exp0.tif zwp.test.exp0 nobatch box.train
8. Generate character set file:
execute the following command: After the execution is performed, a file called “Unicharset” will be generated in the current directory.
unicharset_extractor zwp.test.exp0.box
9. Generate Shape file:
execute the following command. After the execution is completed, two files are generated by Shapetable and ZWP.Unicharset.
shapeclustering -F font_properties -U unicharset -O zwp.unicharset zwp.test.exp0.tr
10. Generate July Character Feature File:
execute the following command, which will generate four files: Inttemp, PFFMTable, Shapetable and Zwp.Unicharset.
mftraining -F font_properties -U unicharset -O zwp.unicharset zwp.test.exp0.tr
11. Generate character normalization feature file:
execute the following command to generate the normproto file.
cntraining zwp.test.exp0.tr
12. File renamed:
The names of the four files of
renamed Inttemp, PFFMTable, Shapetable, and Normproto are [lang] .xxx.
here to modify it to ZWP.INTTEMP, ZWP.PFFMTable, ZWP.Shapetable and Zwp.normProto
execute the following command:
mv normproto zwp.normproto
mv inttemp zwp.inttemp
mv pffmtable zwp.pffmtable
mv shapetable zwp.shapetable
13. Merge training file:
execute the following command to generate zwp.traineData files.
combine_tessdata zwp.
LOG output OFFSET 1, 3, 4, 5, 13 (do not care about the rest), indicating that the new language package is successful.
** Copy the generated “ZWP.TRAINEDDATA” language package file to the TESSSDATA folder in the Tesseract-OCR installation directory, and you can use the training-generated language package for image text. **
14, test:
Input the following command, -L is a language package generated after training. (Of course, before entering the following command, copy the generated “ZWP.TRAINEDDATA” language package file to the TESSDATA folder in the TESSERACT-OCR installation directory, and you can use the trained language package for image text to recognize)
tesseract test.PNG test -l zwp
After using the newly trained language package for text recognition, you will find that the text that cannot be recognized before can be identified.
You only need to have TIF and BOX files to merge. The steps above are almost the same as Step 6. The only difference is the name. Test.exp0.tif. This test (defined by itself) can be the same, and then Step 6 is the same as above, such as zwp.test.exp0.tif
zwp.test.exp1.tif, etc.
From steps from 7, the first thing to generate is the TR file of all BOX files
tesseract zwp.test.exp0.tif zwp.test.exp0 nobatch box.train
tesseract zwp.test.exp1.tif zwp.test.exp1 nobatch box.train
tesseract zwp.test.exp2.tif zwp.test.exp2 nobatch box.train
8. Extract character
unicharset_extractor zwp.test.exp0.box zwp.test.exp2.box zwp.test.exp3.box
9. After executing the following command, after executing, the two files of Shapetable and Unicharset will be generated
shapeclustering -F font_properties -U unicharset -O unicharset zwp.test.exp0.tr zwp.test.exp1.tr zwp.test.exp2.tr
10. Generate July Character Feature File:
execute the following command, which will generate four files: Inttemp, PFFMTable, Shapetable, and Unicharset.
mftraining -F font_properties -U unicharset -O unicharset zwp.test.exp0.tr
11, generate characteristic characteristics file:
execute the following command, which will generate normproto files.
cntraining zwp.test.exp0.tr
12. Then rename all the five files generated above to the beginning of ZWP, such as: zwp.inttemp
execute the following command:
mv unicharset zwp.unicharset
mv normproto zwp.normproto
mv inttemp zwp.inttemp
mv pffmtable zwp.pffmtable
mv shapetable zwp.shapetable
13. Merge all files and generate a large font file,
combine_tessdata zwp.
Original text:https://blog.csdn.net/a745233700/article/details/80175883