Convert Vietnamese documents in legacy character encoding to Unicode using UnicodeConverter

Given here is the instruction on how to use UnicodeConverter to convert Vietnamese text, RTF, HTML, and Word/Excel/PowerPoint files in legacy encodings—VNI, VPS, VISCII, TCVN3, VIQR/Vietnet—NCR (windows-1252, iso-8859-1), and Unicode Composite (NFD) to Unicode Precomposed (NFC). The program comes in three versions—Java, Windows, and .NET; they sport similar graphic user interface and are capable of converting multiple files, a directory, including subdirectories, or an entire website.

The Java program requires Java Runtime Environment 6 or later. You can launch the program by double-clicking on the UnicodeConverter.jar file. If that does not work, you can associate the .jar file extension with the Java interpreter to make it executable by mouse clicking.

The .NET version requires Microsoft .NET Framework 4.0 Redistributable.

Preparations

To ensure successful conversion of HTML files in legacy formats and to minimize post-conversion editing, some pre-conversion conditioning may need to be performed on the source files. Removing obsolete dynamic font links (.pfr or .eot) and associated ActiveX control scripts (e.g., tdserver.js) is recommended (yellow text in the illustration), for leaving them in will needlessly slow down page download.

<html>
<head>
<meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1252">
<title>HISTORY OF VIETNAM</title>
<link REL="FONTDEF" SRC="http://www.nonsong.org/ns.pfr">
<script LANGUAGE="JavaScript" SRC="http://www.nonsong.org/tdserver.js">
</script>
<link>
</head>

<body bgcolor="#FFFFFF" link="#FF0000" vlink="#FF0000">
<font FACE="VNI-Times">

<h1>HISTORY OF VIETNAM</h1>

Changing the original document fonts to the more common ones with respect to its original encoding may also be needed.

Encoding Fonts for original HTML document
VNI VNI-Times, VNI Times, VNI-Aptima, VNI Aptima, VNI- Helve, VNI Helve
VPS VPS Times, VPS Helv
VISCII VI Times, VI Arial, HoangYen, MinhQuân, PhuongThao, ThaHuong, UHoŕi
TCVN3 .VnTime, .VnArial
VIQR No font formatting

These basic editing tasks should be done prior to the actual conversion process and can be expeditiously performed using MDI (multiple document interface) text editors which allow opening multiple files and performing global find/replace actions on all open files at once. CuteHTML, TextPad, UltraEdit, EditPlus, EditPad, etc. are some text editors that sport such useful features. They can be searched and downloaded from http://www.download.com.

Running UnicodeConverter

  1. Java: Launch the conversion program by double-clicking on the UnicodeConverter.jar file or icon or by executing the following command at the command line:

        java -jar UnicodeConverter.jar
    or
        javaw -jar UnicodeConverter.jar

    Note: Be sure the directory that contains the UnicodeConverter.jar file is the current directory.

    .NET: Launch Uni.exe from Windows desktop or explorer.

  2. Select the encoding of the source files and click Select Files if you want to convert files or click the Entire directory, including sub checkbox to switch to directory selection mode.

    To convert a directory/subdirectories using the Windows/.NET program, due to Windows file dialog's inability to select a directory, select instead any file in the directory to provide the program a cue as to what directory is to be converted; in cases if there is no file available to be selected, create in that directory an empty file that has the same file extension as the type of file you want to perform conversion on.

    UnicodeConverter UI

  3. Use the file filter to choose the type of files to work on. Select the files or directory to be converted from the file dialog box. Multiple files can be selected by clicking on the files while pressing and holding Shift or Control key down.

    File Dialog

  4. Click Convert. A message box will appear shortly indicating the directory where the output files are placed. During conversion, a status panel will pop up showing a list of the files that have been processed.

    Output Directory

  5. Done.

The resulting Unicode output files are placed in a x_Unicode directory located at the same tree level as the source directory that contains the original files, which remain unchanged. Verify the UTF-8 encoded HTML files using any Unicode-enabled web browsers, such as Firefox, Netscape, Internet Explorer, Mozilla, Opera, or Safari.

The default fonts for the output files are Times New Roman and Arial. Users can change to other Unicode-compliant fonts, using Unicode-compatible editors or word processors such as FrontPage or Word. Do not use Unicode-incompatible editors (such as Notepad of Win9x/Me) to edit UTF-8 files. Doing so would corrupt the UTF-8 byte sequence, rendering the characters or the file unreadable.

Note: It is recommended that Microsoft Word/Excel/PowerPoint not open any file when you convert Word/Excel/PowerPoint documents. It may cause errors or slow down the conversion process.

Tip: Minimize the number of text boxes within Word documents to a few; having too many will slow down conversion significantly.


Back