Re: UTF-8, UTF-16 and UTF-32

me22 <me22.ca@xxxxxxxxx> · Tue, 26 Aug 2008 14:50:08 -0400

On Tue, Aug 26, 2008 at 14:32, Andrew Haley <aph@xxxxxxxxxx> wrote:
>
> Just in case anyone thinks that UTF-16 might be a good format for saving
> data in files or for data to be sent over a network, here's a gem from
> Microsoft:
>
> 'The example in the documentation didn't specify Little Endian, so
> the Unicode string that the code generates is Big Endian.  The
> SQL Server Driver for PHP expected Big Endian, so the data
> written to SQL Server is not what was expected.  However, because
> the code to retrieve the data converts the string from Big Endian
> back to UTF-8, the resulting string in the example matches the
> original string.
>
> 'If you change the Unicode charset in the example from "UTF-16"
> to "UCS-2LE" or "UTF-16LE" in both calls to iconv, you'll still
> see the original and resulting strings match but now you'll also
> see that the code sends the expected data to the database.'
>
> http://forums.microsoft.com/msdn/ShowPost.aspx?PostID=3644735&SiteID=1
>

Absolutely.  UTF-8 is the only one without possible byte ordering
issues, so it (or UTF-7, if needed) is the only reasonable option for
interchange, since for text, size isn't that high anyways, and with
compression it's not bad at all.  (All the bytes in the UTF-8
representation of a codepoint are the same, for a language, except the
last and maybe second last, so even just a naive huffman can pretty
much eliminate the cost in size over UTF-16, since UTF-16 also has
those prelude bytes for specific languages.)

And really, since at a glyph level even UTF-32 is a variable-width
encoding, you have to think about it anyways, so I don't see why it's
worth not just using UTF-8 everywhere.  (For example, suppose you have
an s codepoint followed by a combining accent codepoint.  Pressing
"backspace" with the cursor after it should, probably, erase both
codepoints.  At the same time, if it's an ffi ligature, then probably
backspace should replace it with an ff ligature.  So since you can't
just do --size on your string anyways...)

~ Scott

P.S.  Are there any architectures around using middle-endian UTF-32? ;)