Re: Unicode problem ???

Björn Lundin <bnl@tiscali.se> · Wed, 21 Apr 2004 21:35:19 +0200

Stijn Vanroye wrote:

> Of what I hear, UNICODE indeed seems the best option. But then again, that
> encoding stuff is still a bit of a mistery to me. What I personally don't

The following is cut from the documentation of XML Ada
(xmlada-1.0/docs/xml_2.html#SEC6) 
( which is available at http://libre.act-europe.fr/xmlada/)

<quote>
We now know how each encoded character can be represented by an integer
value (code point) depending on the character set. 

Character encoding schemes deal with the representation of a sequence of
integers to a sequence of code units. A code unit is a sequence of bytes on
a computer architecture. 

There exists a number of possible encoding schemes. Some of them encode all
integers on the same number of bytes. They are called fixed-width encoding
forms, and include the standard encoding for Internet emails (7bits, but it
can't encode all characters), as well as the simple 8bits scheme, or the
EBCDIC scheme. Among them is also the UTF-32 scheme which is defined in the
Unicode standard. 

Another set of encoding schemes encode integers on a variable number of
bytes. These include two schemes that are also defined in the Unicode
standard, namely Utf-8 and Utf-16. 

Unicode doesn't impose any specific encoding. However, it is most often
associated with one of the Utf encodings. They each have their own
properties and advantages: 

Utf32 
This is the simplest of all these encodings. It simply encodes all the
characters on 32 bits (4 bytes). This encodes all the possible characters
in Unicode, and is obviously straightforward to manipulate. However, given
that the first 65535 characters in Unicode are enough to encode all known
languages currently in use, Utf32 is also a waste of space in most cases. 

Utf16 
For the above reason, Utf16 was defined. Most characters are only encoded on
two bytes (which is enough for the first 65535 and most current
characters). In addition, a number of special code points have been
defined, known as surrogate pairs, that make the encoding of integers
greater than 65535 possible. The integers are then encoded on four bytes.
As a result, Utf16 is thus much more memory-efficient and requires less
space than Utf32 to encode sequences of characters. However, it is also
more complex to decode. 

Utf8 
This is an even more space-efficient encoding, but is also more complex to
decode. More important, it is compatible with the most currently used
simple 8bit encoding. 

Utf8 has the following properties: 

Characters 0 to 127 (ASCII) are encoded simply as a single byte. This means
that files and strings which contain only 7-bit ASCII characters have the
same encoding under both ASCII and UTF-8. 

Characters greater than 127 are encoded as a sequence of several bytes, each
of which has the most significant bit set. Therefore, no ASCII byte can
appear as part of any other character. 

The first byte of a multibyte sequence that represents a non-ASCII character
is always in the range 0xC0 to 0xFD and it indicates how many bytes follow
for this character. All further bytes in a multibyte sequence are in the
range 0x80 to 0xBF. This allows easy resynchronization and makes the
encoding stateless and robust against missing bytes. 

UTF-8 encoded characters may theoretically be up to six bytes long, however
the first 16-bit characters are only up to three bytes long. 

Note that the encodings above, except for Utf8, have two versions, depending
on the chosen byte order on the machine. 

</quote>

So yes, Unicode in Utf8 is tricky to handle
/Björn

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org