Vietnamese Unicode FAQs

WHAT IS UNICODE?

Unicode (UCS-2 ISO 10646) is a 16-bit character encoding that contains all of the characters (216 = 65,536 different characters total) in common use in the world's major languages, including Vietnamese. The Universal Character Set provides an unambiguous representation of text across a range of scripts, languages and platforms. It provides a unique number, called a code point (or scalar value), for every character, no matter what the platform, no matter what the program, no matter what the language. The Unicode standard is modeled on the ASCII character set. Since ASCII's 7-bit character size is inadequate to handle multilingual text, the Unicode Consortium adopted a 16-bit architecture which extends the benefits of ASCII to multilingual text.

Unicode characters are consistently 16 bits wide, regardless of language, so no escape sequence or control code is required to specify any character in any language. Unicode character encoding treats symbols, alphabetic characters, and ideographic characters identically, so that they can be used simultaneously and with equal facility. Computer programs that use Unicode character encoding to represent characters but do not display or print text can (for the most part) remain unaltered when new scripts or characters are introduced.

The Unicode Standard has been adopted by such industry leaders as Apple, HP, IBM, Microsoft, Oracle, SAP, Sun, Sybase, Unisys, and many others. Unicode is required by modern standards such as XML, Java, .NET, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc., and is the official way to implement ISO/IEC 10646. It is supported in many operating systems, all modern browsers, and many other products. The emergence of the Unicode Standard, and the availability of tools supporting it, offers significant cost savings over the use of legacy character sets. It allows data to be transported through many different systems without corruption.

UNICODE AS A NATIONAL STANDARD?

At present, a number of countries, like China, Korea, and Japan, have adopted Unicode as their national standards, sometimes after adding additional annexes with cross-references to older national standards and specifications of various national implementation subsets.

In September 2001, Vietnam's Ministry of Science, Technology and Environment (MOSTE) issued the TCVN 6909:2001 standard, which is based on ISO/ICE 10646 and Unicode 3.1, as the new national standard for Vietnamese 16-bit character encoding.

WHAT IS UTF-8?

The Unicode Standard (ISO 10646) defines a 16-bit universal character set which encompasses most of the world's writing systems. 16-bit characters, however, are not compatible with many current applications and protocols that assume 8-bit characters (such as the Web) or even 7-bit characters (such as mail), and this has led to the development of a few so-called UCS transformation formats (UTF), each with different characteristics. Unicode provides for a byte-oriented encoding called UTF-8 that has been designed for ease of use with existing ASCII-based systems. UTF-8 is the Unicode Transformation Format that serializes a Unicode code point as a unique sequence of one to four bytes. The UTF-8 encoding allows Unicode to be used in a convenient and backwards compatible way in environments that, like Unix, were designed entirely around ASCII. It was introduced to provide an ASCII backwards compatible multi-byte encoding.

The Unicode UTF-8 format of ISO 10646 is the preferred default character encoding for internationalization of Internet application protocols. It will be most common on the world wide web. Being multiple-byte format, it is naturally fit for the web as the web itself is based on 8-bit protocols. UTF-8, in fact, is the only Unicode format that is commonly supported by web browsers. It is being adopted and deployed by many major Vietnamese online media and publications.

A Vietnamese-language file in UTF-8 encoding is roughly 1.2 times larger than a file with same content but encoded using legacy encoding formats (VPS, VISCII, TCVN3, i.e.), for Vietnamese characters (mostly, vowels) in UTF-8 format usually require two to three bytes to represent. Followed are some examples of Viet characters in UTF-8 format.

Vietnamese Character 16-bit Unicode UTF-8 Bytes

Ồ U+1ED2 E1 BB 92: á»’

ồ U+1ED3 E1 BB 93: á»“

Ờ U+1EDC E1 BB 9C: á»œ

ơ U+01A1 C6 A1: Æ¡

ư U+01B0 C6 B0: Æ°

ứ U+1EE9 E1 BB A9: á»©

UNICODE NORMALIZATION FORMS

In addition to the encoding alternatives, the Unicode standard also specifies various Normalization Forms to remove encoding ambiguities caused by the presence of precomposed and compatibility characters:

Normalization Form C (NFC): The normalization form that results from the canonical decomposition of a Unicode string, followed by the replacement of all decomposed sequences by single-codepoint precomposed characters where possible, e.g., U+00C1 (LATIN CAPITAL LETTER A WITH ACUTE) normalized from U+0041 U+0301 (LATIN CAPITAL LETTER A, COMBINING ACUTE).
Normalization Form D (NFD): The normalization form that results from the canonical decomposition of a Unicode string. Precomposed characters are decomposed into combining character sequences where possible, e.g., U+0041 U+0301 (LATIN CAPITAL LETTER A, COMBINING ACUTE) normalized from U+00C1 (LATIN CAPITAL LETTER A WITH ACUTE). If a character is not decomposable, then its canonical decomposition is equal to itself.
Normalization Form KC (NFKC): Practically the same as NFC for Vietnamese characters
Normalization Form KD (NFKD): Practically the same as NFD for Vietnamese characters

Several combining characters can be applied when it is necessary to stack multiple accents or add combining marks both above and below the base character. Display of NFD combining character sequences may not be optimal if the diacritical marks are not perfectly aligned with the base character. This display problem exists in many applications and platforms as the required font rendering mechanism can be highly complex operations and may not be supported. With NFD, during editing, there exists a possibility that the combining diacritical marks be accidentally separated from the base character they intend to qualify or belong if there are interjecting characters mistakenly typed between them, e.g., tháng may result in thańg.

Precomposed characters are easier to handle and look better on displays and in print. They should be preferred over combining character sequences where available. NFC is the preferred way of encoding text in Unicode under Linux. The W3C Character Model for the World Wide Web also uses NFC for XML and related standards.

In computer programming context, the string length function in many modern programming languages, such as Java or C#, can return an unexpected number of characters for non-NFC strings. For instance, the length function operation on "ệ" returns 2 (if "ê"+"." or "ẹ"+"^") or 3 (if "e"+"^"+"." or "e"+"."+"^", being fully decomposed), which is correct, dependent of the case, but does not look consistent with the appearance of the string. When the string "ệ" is in NFC format, the length operation would consistently resolve to 1.

UNICODE & VIETNAMESE CHARACTER ENCODINGS

All legacy Vietnamese character encodings were based on an 8-bit character set similar to the Latin-1 ANSI character set. Most popular among them were VNI, VPS, VISCII, and TCVN3 (ABC). Follow this link for a comparison study of Unicode and these legacy character sets.

UNICODE-SUPPORTED WEB BROWSERS

Netscape, Mozilla, Internet Explorer, Opera, and Safari web browsers provide support for displaying Unicode UTF-8-encoded HTML files. Users will not need to change the default settings of their browsers in order to view UTF-8 pages that are coded using the appropriate HTML tags.

Internet web browsers use the character set specified for a document to determine how to translate the bytes in the document into characters on the screen or on paper. By default, they use the character set specified in the HTTP content type returned by the server to determine this translation. If this parameter is not given, they use the character set specified by the meta element in the document. They use the user's preferences if no meta element is specified.

You can use the meta element to explicitly set the character set for a document. In this case, set the HTTP-EQUIV attribute to Content-Type and specify a character set identifier in the CONTENT attribute. For example, the following meta element identifies Unicode UTF-8 as the character set for the document.

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

To apply a character set to an entire document, you must insert the meta element before the body element. For clarity, it should appear as the first element after head, so that all browsers can translate the meta element before the document is parsed. The meta element applies to the document containing it. This means, for example, that a compound document (a document consisting of two or more documents in a set of frames) can use different character sets in different frames.

UNICODE-COMPLIANT FONTS

Windows 95/98/Me have only limited support for Unicode, yet they are still capable of displaying all Vietnamese characters using appropriate Unicode fonts. Full Unicode support is built into Windows NT/2000/XP. Linux and Mac OS 8.5 or greater have begun to support Unicode. Mac OS X and Palm OS provide full Unicode support.

The following Windows fonts, which come supplied with Windows 98SE/Me/2000/XP, contain many Unicode characters, including Vietnamese:

Times New Roman, Courier New, Arial, Tahoma, Verdana, Palatino Linotype

Note: Users of Windows 95/98/NT should download the latest versions of these fonts, as the older versions, which are not fully Unicode-compliant, would display question marks (?) or squares (◻) for unsupported characters. They can be downloaded from the Internet. These fonts are also included in WinNT Service Pack 4, in Internet Explorer 5.5 or later, and in Microsoft Office 2000.

This list of Unicode fonts is by no means comprehensive, as there are more and more fonts are being commercially developed or expanded to include Unicode characters.

UNICODE-ENABLED KEYBOARD DRIVERS

There are many open-source or free Unicode-compatible Vietnamese keyboard drivers available on the Internet. They typically support the three most common Vietnamese input methods: Telex, VNI, and VIQR. For them to work, the applications that they are to be used with must also support entry and display of Unicode characters.

For typing Vietnamese Unicode characters on Windows, you can use VPSKeys 4.3, WinVNKey 4.0, or UniKey 3.6 to produce Unicode-encoded HTML pages and text documents using popular editor or word processor applications. Be sure to specify Unicode-compliant fonts and select appropriate Unicode character encoding for your documents.

On Linux/Unix systems, you can use xvnkb or X-Unikey for Vietnamese input in X-Window.

On Mac OS X from version 10.2, you can use Telex, VNI, and VIQR keyboard layouts that emit Vietnamese text in Unicode Normalization Form C (NFC), or the built-in layout.

On Java 2 platforms, you can use VietIME to input Vietnamese Unicode text in Java's AWT and Swing text components.

Note: Do not use a Unicode-incompatible text editor (such as Notepad of Windows 98/Me) to modify a text document or a HTML source file encoded in UTF-8 format. Doing so would corrupt the UTF-8 byte sequence, rendering the characters unreadable. Examples of Unicode-compatible text editors are Notepad of Windows NT/2000/XP and VietPad.

UNICODE-COMPATIBLE VIETNAMESE DOCUMENTS

How to create Vietnamese Unicode documents is an essential guide on how to create Vietnamese-language HTML and text documents that are in compliance with Unicode standard. It covers the topic with practical examples using popular word processors and HTML editors.

UNICODE CONVERSION UTILITIES

Vietnamese HTML documents on many Vietnamese-language web sites and in archives around the world are currently still in various legacy encoding formats. There are few utility programs available to convert these legacy documents to Unicode-standard formats. They can convert text, HTML, and Word files encoded in VNI, VISCII, VPS, TCVN3, or VIQR/Vietnet format to Unicode formats, and are capable of converting multiple files, a directory, including subdirectories, or an entire website.

To ensure successful conversion of files in legacy formats and to minimize post-conversion editing, some pre-conversion conditioning needs to be performed on the source files. Changing the original document fonts to the more common ones with respect to its original encoding may be needed. Removing obsolete dynamic font links (.pfr or .eot) and associated ActiveX control scripts (e.g., tdserver.js) is also recommended, for leaving them in will needlessly slow down page download.

These basic editing tasks should be done prior to the actual conversion process and can be expeditiously performed using MDI (multiple document interface) text editors which allow opening multiple files and performing global find/replace actions on all open files at once. CuteHTML, TextPad, UltraEdit, EditPlus, and EditPad are some text editors that sport such useful features. They can be searched and downloaded from http://www.download.com.

UNICODE EMAILS

Many current mail gateways were designed around the time when Internet messages were originally defined to be 7-bit ASCII only. As a result, UTF-8 HTML and text files and messages which use 8-bit characters are still being stripped by these email gateways during their transmission, handling, or delivery, rendering the files unreadable. The file corruption is usually evidenced by the appearance of inverted question marks (¿) in place of unsupported characters. (The 8-bit problem has led to the invention of UTF-7.) One way to get around this problem is to "wrap" the UTF-8 files in zip format before sending them as email's file attachment. Users on the receiving end only need to unzip the attached zip file to recover the original UTF-8 files.

Note: Microsoft Office documents, however, seem immune from this problem. They are able to retain their file encoding information when sent as email's attachments.

The 7-bit mail gateways are being replaced by more modern 8-bit programs which can handle UTF-8 files without modifications. BasiliX and NeoMail are some examples of email gateways compatible with UTF-8. Popular Hotmail and Yahoo currently offer Unicode-compatible email services.

UNICODE PRINTING

Printing Unicode documents is problematic sometimes in Windows 9x/Me due to their partial support of Unicode; nevertheless, most can be resolved by updating the printer driver to the latest version or by setting appropriate options of printer settings. This usually involves selecting send font (or True Type) as bitmap options. Another solution is using a commercial printer driver software, such as FinePrint.