Recognize Vietnamese text using Tesseract OCR

Recognize Vietnamese text using Tesseract OCR (English | Vietnamese)

After installing Tesseract, download and uncompress the Vietnamese language data pack for Tesseract into tesseract installation folder; the vie. files will be placed in the tessdata subdirectory. To perform OCR on an image using Tesseract:

tesseract vietsample.tif output –l vie

The language data were generated specifically for Times New Roman, Arial, Verdana, and Courier New fonts. Therefore, the recognition would have better success rate for images having similar font glyphs. OCRing images that have font glyphs look different from the supported fonts generally will require training Tesseract to create another language data pack specifically for those typefaces.

Update: More language data has been generated for legacy Vietnamese fonts — VNI and TCVN3 (ABC). (Read how to install.)

TIFF images to be OCRed should be scanned at resolution from at least 200 DPI (dot per inch) to 400 DPI. Scanning at higher resolutions will not necessarily result in better recognition accuracy, which currently can be higher than 97% for Vietnamese (test image), and the next release of Tesseract may improve it even further. Even so, the actual rates still depend greatly on the quality of the scanned image. The settings for scanning are typically 300 DPI and 1 bpp (bit per pixel) black&white or 8 bpp grayscale uncompressed TIFF format.

In cases that images do not have adequate quality for OCR, you can use standard image manipulation tools, such as GIMP, ImageMagick, or unpaper, to improve them. Some of these image tools can even accept PDF files and export them to image formats suitable for OCR.

There is a GUI frontend program for Tesseract OCR engine that you can use: VietOCR, an open-source Java/.NET application, provides document scanning and recognition support for PDF, TIFF, JPEG, GIF, PNG, and BMP image formats.

The recognition errors can be classified into three categories. Many of the errors are related to the letter cases — for example: hOa, nhắC — which can be easily corrected by popular Unicode text editors. Many other errors are a result of the OCR process, such as missing diacritical marks, wrong letters with similar shape, etc. — huu – hưu, mang – marg, h0a – hoa, la – 1a, uhìu - nhìn. These can also be easily fixed by Vietnamese spell checker programs.

The last category of errors is the most difficult to detect because they are semantic errors, which means that the words are valid entries in the dictionary but are wrong in the context — e.g., tinh – tình, vân – vấn. These errors require the editor to read though and manually correct them according to the original image.

Following are instructions on how to correct the OCR errors in a speedy and effective way the first two groups of errors using VietPad text editor. The process can be summarized as follows:

Group lines. The lines need to be grouped to the paragraph they belong, as being OCRed, each line becomes a separate 1-line paragraph. Use Remove Line Breaks function under Format menu. Note that this operation may not be needed for poems.
Select Change Case, also under Format menu, and choose Sentence case to correct most of the letter case errors. Locate and fix the rest of remaining letter case errors.
Correct the misspelled errors using the Spell Check under Tool menu.

Through the above steps, most of common errors should be eliminated. The remaining, semantic errors are few, but it requires a human editor to read though and make necessary edits to make the document like the original scanned document, and error-free if desired.

If there is any questions, please post in VietUnicode Forum.