UnicodeConverter is a Windows (J++) program that converts text and HTML files in VNI, VISCII, VPS, TCVN3 (ABC), VIQR/Vietnet, NCR, and Unicode Composite (NFD) formats to Unicode Precomposed (NFC). The program is capable of converting multiple files in a directory, or an entire directory, including its subdirectories. In effect, this expanded capability enables conversion of an entire website to Unicode UTF-8 with a few mouse clicks.
Support for conversion of Word documents and Excel workbooks on the Windows platform is included. This feature is implemented using JACOB, a Java-COM Bridge that allows to call COM Automation components from Java. JACOB uses Java Native Interface (JNI) to make native calls into the COM and Win32 libraries; consequently, the added functionality is not portable nor available to other platforms. Support for Rich Text Format files is also provided.
UnicodeConverter is released and distributed under the GNU General Public License. Its homepage is at http://unicodeconvert.sourceforge.net.
Microsoft Java Virtual Machine for Windows 95/98/Me/NT/2000 (http://www.microsoft.com/java).
To be able to convert Word or Excel documents, you'll need to be on a Windows system with Microsoft Word or Excel installed. Put the file jacob.dll in your path, for example, into the system32 folder.
UnicodeConverter can be launched from Windows desktop or explorer. If there is any errors occur during launching the program, installing the latest Microsoft VM will most likely resolve it.
Note: It is recommended that Microsoft Word/Excel not open any file when you convert Word/Excel documents. It may cause errors or slow down the conversion process.
You can select single or multiple files for conversion. There is a limit on the number of selected files per directory that can be converted at a time; this constraint (around 350) is due to the length limit on the File name property of Microsoft File Dialog. When convert a directory, select any file in the directory to provide the program a cue as to what directory is to be converted; in cases if there is no file to be selected, create in that directory an empty file that has the same file extension as the type of file you want to perform conversion on.
Unicode Composite (NFD) source text files should be saved in UTF-8 format for correct conversion to Unicode Precomposed (NFC).
The resulting Unicode output files will be placed in a x_Unicode directory located at the same tree level as the source directory that contains the original files, which remain unchanged.
The default fonts for the output files are Times New Roman and Arial. Users can change to other Unicode-compliant fonts, using Unicode-compatible HTML editors or word processors, such as FrontPage, Composer, or Microsoft Word. Do not use Unicode-incompatible editors (such as Notepad of Win9x/Me) to edit UTF-8 files. Doing so would corrupt the UTF-8 byte sequence, rendering the characters unreadable or incorrect.
Use Firefox, Netscape, Internet Explorer (Windows), Opera, or Mozilla web browser to view UTF-8 HTML files. You will not need to change their default settings; the <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> tag tells the browsers to use Unicode UTF-8 character encoding in displaying the page.
To ensure successful conversion of HTML files in legacy formats and to minimize post-conversion editing, some pre-conversion conditioning may need to be performed on the source files. Changing the original document fonts to the more common ones with respect to its original encoding may be needed. Removing obsolete dynamic font links (.pfr or .eot) and associated ActiveX control scripts (e.g., tdserver.js) is also recommended, for leaving them in will needlessly slow down page download.
These basic editing tasks should be done prior to the actual conversion process and can be expeditiously performed by using MDI (multiple document interface) text editors which allow opening multiple files and performing global find/replace actions on all open files at once. CuteHTML, TextPad, UltraEdit, EditPlus, and EditPad are some text editors that sport such useful features. They can be searched and downloaded from http://www.download.com.
Encoding | Fonts for original HTML documents |
VNI | VNI-Times, VNI Times, VNI-Aptima, VNI Aptima, VNI-Helve, VNI Helve |
VPS | VPS Times, VPS Helv |
VISCII | VI Times, VI Arial, HoangYen, MinhQuân, PhuongThao, ThaHuong, UHoài |
TCVN3 | .VnTime, .VnTimeH, .VnArial, .VnArialH |
VIQR/Vietnet | No font formatting |
Note: Due to the nature of TCVN3 encoding, conversion of some Vietnamese capital vowels will result in incorrect, lower case. Some post-conversion editing may be necessary.
Unicode has only limited support in Windows 95/98/Me, but they are still capable of displaying all Vietnamese characters using appropriate Unicode fonts. Full Unicode support is built into Windows NT/2000/XP. Linux and Mac OS 8.5 or greater have begun to provide support Unicode. Mac OS X and Palm OS provide full Unicode support.
The following Windows fonts, which come supplied with Windows 98SE/Me/2000/XP, contain many Unicode characters, including Vietnamese:
Times New Roman, Courier New, Arial, Tahoma, Verdana, Palatino Linotype
This list of Unicode fonts is by no means comprehensive, as there are more and more fonts are being commercially developed or expanded to include Unicode characters.
Note: Users of Windows 95/98/NT should download the latest versions of these fonts, as the older versions, which are not fully Unicode-compliant, would display question marks (?) or squares (◻) for unsupported characters. They can be downloaded from http://sourceforge.net/projects/corefonts or http://sourceforge.net/projects/vietunicode.