VietOCR.NET

DESCRIPTION

VietOCR.NET is a .NET GUI frontend for Tesseract OCR engine, providing character recognition support for common image formats, and multi-page images. The program has postprocessing which helps correct errors regularly encountered in the OCR process, boosting the accuracy rate on the result. The program can also function as a console application, executing from the command line.

Batch processing is now supported. The program monitors a watch folder for new image files, automatically processes them through the OCR engine, and outputs recognition results to an output folder.

SYSTEM REQUIREMENTS

Microsoft .NET Framework 2.0 Redistributable.

If you encounter a FileLoadException with message "Could not load file or assembly 'tesseract, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null' or one of its dependencies. This application has failed to start because the application configuration is incorrect. Reinstalling the application may fix this problem. (Exception from HRESULT: 0x800736B1)" while running VietOCR.NET, please install Microsoft Visual C++ 2010 SP1 Redistributable Package.

INSTALLATION

If you do not have authority to install under C:\Program Files folder, you can specify another folder in Install Installation Folder dialog.

Scanning support is provided via the Windows Image Acquisition Library v2.0, which requires Windows XP Service Pack 1 (SP1) or later; the library is included in Vista and 7. To install the WIA Library, copy the wiaaut.dll file to your System32 directory (usually located at C:\Windows\System32) and run from the command line:

regsvr32 C:\Windows\System32\wiaaut.dll

PDF support is possible via GPL Ghostscript. After installation of the library, please ensure the dynamic load library gsdll32.dll is in the search path by setting the Path environment variable, which is accessible through Windows' Control Panel > System > Advanced tab > Environment Variables. For instance, append the following to Path variable value for GS version 9.15:

;C:\Program Files\gs\gs9.15\bin

Spellcheck functionality is available through Hunspell, whose dictionary files (.aff, .dic) should be placed in dict folder of VietOCR.

INSTRUCTIONS

Language data packs for Tesseract should be decompressed into tesseract installation folder; the data files, whose names start with ISO639-3 codes, will be placed in the tessdata subdirectory. VietOCR also provides support for downloading and installing selected language packs via Download Language Data menu item. Depending on the location of the tessdata folder, you may be required to run the program as admin to be able to install the downloaded data into the folder if it is inside a system folder, such as in C:\Program Files.

The Vietnamese language data were generated for Times New Roman, Arial, Verdana, and Courier New fonts. Therefore, the recognition would have better success rate for images having similar font glyphs. OCRing images that have font glyphs look different from the supported fonts generally will require training Tesseract to create another language data pack specifically for those typefaces. Language data for some VNI and TCVN3 (ABC) fonts have also been bundled in latest versions.

Images to be OCRed should be scanned at resolution from at least 200 DPI (dot per inch) to 400 DPI in monochrome (black&white) or grayscale. Scanning at higher resolutions will not necessarily result in better recognition accuracy, which currently can be higher than 97% for Vietnamese, and the next release of Tesseract may improve it even further. Even so, the actual rates still depend greatly on the quality of the scanned image. The typical settings for scanning are 300 DPI and 1 bpp (bit per pixel) black&white or 8 bpp grayscale uncompressed TIFF or PNG format.

The Screenshot Mode offers better recognition rates for low-resolution images, such as screen prints, by rescaling them to 300 DPI.

In addition to the built-in text postprocessing algorithm, you can add your own custom text replacement scheme via a text file named x.DangAmbigs.txt, where x is the ISO639-3 language code. The UTF-8-encoded file should contain equal sign-delimited oldValue=newValue pairs.

You can put init-only and non-init control parameters in tessdata/configs/tess_configs and tess_configvars files, respectively, to modify Tesseract's behaviour.

Some built-in tools are provided to merge several images or PDF files into a single one for convenient OCR operations, or to split a PDF file into smaller ones if it is too large, which can cause out-of-memory exceptions.

POSTPROCESSING

The recognition errors can generally be classified into three categories. Many of the errors are related to the letter cases — for example: hOa, nhắC — which can be easily corrected by popular Unicode text editors. Many other errors are a result of the OCR process, such as missing diacritical marks, wrong letters with similar shape, etc. — huu – hưu, mang – marg, h0a – hoa, la – 1a, uhìu - nhìn. These can also be easily fixed by spell checker programs. The built-in Postprocessing function can help correct many of the aforementioned errors.

The last category of errors is the most difficult to detect because they are semantic errors, which means that the words are valid entries in the dictionary but are wrong in the context — e.g., tinh – tình, vân – vấn. These errors require the editor to read though and manually correct them according to the original image.

Following are instructions on how to correct the first two categories of OCR errors using the built-in functionality:

  1. Group lines. The lines need to be grouped to the paragraph they belong, as being OCRed, each line becomes a separate 1-line paragraph. Use Remove Line Breaks function under Format menu. Note that this operation may not be needed for poems.
  2. Select Change Case, also under Format menu, and choose Sentence case to correct most of the letter case errors. Locate and fix the rest of remaining letter case errors.
  3. Correct the misspelled errors using the integrated Spell Check.

Through the above process, most of common errors can be eliminated. The remaining, semantic errors are few, but it requires a human editor to read though and make necessary edits to make the document like the original scanned document.

If there is any questions, please post in VietOCR Forum.