UnicodeConverter

Java version 2.1

DESCRIPTION

UnicodeConverter is a Java program that converts text and HTML files in ISC, TCVN3 (ABC), VISCII, VNI, and VPS format to Unicode UTF-8. Conversion support for Unicode Composite, Numeric Character References (NCR), and VIQR (Vietnet) is also included. In all cases, the output will be in Unicode Normalization Form C, or better known as Unicode Precomposed format. The program, executable in both graphic user interface (GUI) and command-line modes, is capable of converting multiple files in a directory, or an entire directory, including its subdirectories. In effect, this enhanced capability enables conversion of an entire website to Unicode UTF-8 format with one single command or a few mouse clicks. Drag-and-Drop support is also included.

Support for conversion of Word documents, Excel workbooks, and PowerPoint presentations on the Windows platform is included. This feature is implemented using JACOB, a Java-COM Bridge that allows clients to call COM Automation components from Java. JACOB uses Java Native Interface (JNI) to make native calls into the COM and Win32 libraries; consequently, the added functionality is not portable nor available to other platforms. Conversion support for Rich Text Format files is also provided.

UnicodeConverter is released and distributed under the GNU General Public License. Its homepage is at http://unicodeconvert.sourceforge.net.

SYSTEM REQUIREMENTS

You will need to have the Java Runtime Environment, Standard Edition (JRE) 6 or later installed on your machine to execute UnicodeConverter. JRE can be downloaded free from http://www.oracle.com/technetwork/java/javase/downloads. The Java Runtime Environment, Standard Edition (JRE) consists of the Java virtual machine, the Java platform core classes, and supporting files to allow you to run applications written in the Java programming language.

For Mac OS X 10.5.2 or later, Java Standard Edition 6 can be installed via Software Update in System Preferences.

To be able to convert Word/Excel/PowerPoint documents, you'll need to be on a Windows system with Microsoft Word/Excel/PowerPoint installed. Put the file jacob.dll in your path, for example, into the system32 or jre/bin folder.

HOW TO RUN UnicodeConverter

UnicodeConverter is written in Java language and packaged as executable Java-Archive. Download and unzip UnicodeConverter-2.1.zip. UnicodeConverter.jar is the Java-Archive executable program to be run. You can run it either by double-clicking the UnicodeConverter.jar file or by executing the command uni at the command line to launch the program in GUI mode. Alternatively, the longer commands

java -jar UnicodeConverter.jar

or (on Windows)

javaw -jar UnicodeConverter.jar

will work, too. The filename is case-sensitive on some operating systems. Be sure the directory that contains the UnicodeConverter.jar file is the current directory.

Note: It is recommended that Microsoft Word/Excel/PowerPoint not open any file when you convert Word/Excel/PowerPoint documents. It may cause errors or slow down the conversion process.

Tip: Minimize the number of text boxes within Word documents to a few; having too many will slow down conversion significantly.

You can select single or multiple files, or a directory d for conversion. The resulting Unicode output files will be placed in a d_Unicode directory located at the same tree level as the source directory that contains the original files, which remain unchanged. You also can drag files or directory from native file manager and drop onto the application window to initiate conversion operation.

The program can also function as a command-line program, which is frequently used in batch file processing:

java -jar UnicodeConverter.jar <SourceEncoding> <SourceFile/Dir> <TargetFile/Dir>

where possible options for source encoding are VNI, VISCII, VPS, VIQR, TCVN3, and UNI-COMP. This functionality works for text-based files only, not Word/Excel/PowerPoint documents.

Unicode composite (UNI-COMP) source text files should be saved in UTF-8 format for correct conversion to Unicode precomposed.

The default fonts for the output UTF-8 HTML files are Times New Roman, and Arial. Users can change to other Unicode-compliant fonts, using Unicode-compatible HTML editors such as FrontPage or Dreamweaver. Do not use Unicode-incompatible editors (such as Notepad of Win9x/Me) to edit UTF-8 files. Doing so would corrupt the UTF-8 byte sequence, rendering the characters or the file unreadable.

FILE PREPARATIONS FOR CONVERSION

To ensure successful conversion of HTML files in legacy formats and to minimize post-conversion editing, some pre-conversion conditioning may need to be performed on the source files. Changing the original document fonts to the more common ones with respect to its original encoding may be needed (see table below). Removing obsolete dynamic font links (.pfr or .eot) and associated ActiveX control scripts (e.g., tdserver.js) is also recommended, for leaving them in will needlessly slow down page download.

These basic editing tasks should be done prior to the actual conversion process and can be expeditiously performed by using MDI (multiple document interface) text editors which allow opening multiple files and performing global find/replace actions on all open files at once. CuteHTML, TextPad, UltraEdit, EditPlus, and EditPad are some text editors that sport such useful features. They can be searched and downloaded from http://www.download.com.

Source Encoding Fonts for original HTML documents
VNI VNI-Times, VNI Times, VNI-Aptima, VNI Aptima, VNI-Helve, VNI Helve
VPS VPS Times, VPS Helv
VISCII VI Times, VI Arial, HoangYen, MinhQuân, PhuongThao, ThaHuong, UHoài
TCVN3 .VnTime, .VnTimeH, .VnArial, .VnArialH
VIQR No font formatting

Note: Due to the nature of TCVN3 encoding, conversion of some Vietnamese capital vowels will result in incorrect, lower case. Some post-conversion editing may be necessary.

UNICODE-COMPLIANT FONTS

Unicode has only limited support in Windows 95/98/Me, but they are still capable of displaying all Vietnamese characters using appropriate Unicode fonts. Full Unicode support is built into Windows NT/2000/XP. Linux and Mac OS 8.5 or greater have begun to provide support Unicode. Mac OS X and Palm OS provide full Unicode support.

The following TrueType fonts, which come supplied with Windows 98SE/Me/2000/XP, contain many Unicode characters, including Vietnamese:

Times New Roman, Courier New, Arial, Tahoma, Verdana, Palatino Linotype

This list of Unicode fonts is by no means comprehensive, as there are more and more fonts are being commercially developed or expanded to include Unicode characters.

Note: Users of Windows 95/98/NT or Mac OS X should download the latest versions of these fonts, as the older versions, which are not fully Unicode-compliant, would display question marks (?), squares (◻), or glyphs from substitute fonts for unsupported characters. They can be downloaded from http://sourceforge.net/projects/corefonts or http://sourceforge.net/projects/vietunicode.