Re: Concerning about Unicode-aware string handling

Craig Ringer <ringerc@xxxxxxxxxxxxx> · Tue, 22 May 2012 12:31:50 +0800

On 05/21/2012 06:59 PM, Andrew Sullivan wrote:
On Mon, May 21, 2012 at 02:44:45AM -0700, John R Pierce wrote:
support the bastardized UTF-16 'unicode' implemented by Windows NT
To be fair to Microsoft, while the BOM might be an irritant, they do
use a perfectly legitimate encoding of Unicode.  There is no Unicode
requirement that code points be stored as UTF-8, and there is a strong
argument to be made that, for some languages, UTF-8 is extremely
inefficient and therefore the least preferred encoding.  (Microsoft's
dependence on the BOM with UTF-16 -- really UCS2 -- is problematic, of
course, and appears to be adjusted in funny ways in Win 7.)

In fact, until it became clear that UCS-2 (now UTF-16) wasn't enough and 
we'd need 4 bytes to represent characters, Microsoft's choice of UCS-2 
with BOM looked really good. They just didn't realise that UCS-2 would 
turn into UTF-16 when UCS-4 came on the scene, so they'd be left holding 
a bastardised half-way mess that's usually-but-not-always 2 bytes per 
character.

MS's choice allowed programs to work with the safe (at the time) 
assumption that each char was 2 bytes, which made a lot of things way 
simpler than they are in UTF-8 and was well and truly worth the storage 
bloat IMO. Pity Unicode had to grow again and break the assumption.

--
Craig Ringer

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general