Re: Encoding, Unicode, locales, etc.

Karsten Hilbert <Karsten.Hilbert@xxxxxxx> · Wed, 1 Nov 2006 11:41:43 +0100

On Tue, Oct 31, 2006 at 11:47:56PM -0500, Tom Lane wrote:

> Because we depend on libc's locale support, which (on many platforms)
> isn't designed to switch between locales cheaply.  The fact that we
> allow a per-database encoding spec at all was probably a bad idea in
> hindsight --- it's out front of what the code can really deal with.
> My recollection is that the Japanese contingent argued for it on the
> grounds that they needed to deal with multiple encodings and didn't
> care about encoding/locale mismatch because they were going to use
> C locale anyway.  For everybody else though, it's a gotcha waiting
> to happen.

Could this paragraph be put into the docs and/or the FAQ,
please ? Along with the recommendation that if you require
multiple encodings for your databases you better had your OS
locale configured properly for UTF8 and use UNICODE
databases or do initdb with the C-locale.

> This stuff is certainly far from ideal, but the amount of work involved
> to fix it is daunting; see many past pg-hackers discussions.

Here are a few data points from my Debian/Testing system in
favour of not worrying too much about installed ICU size as
it is being used by other packages anyways:

libicu36
Reverse Depends:
  openoffice.org-writer				* OOo
  openoffice.org-filter-so52
  openoffice.org-core
  libxerces27						* Xerces XML parser (Apache camp)
  libboost-regex1.33.1
  libboost-dbg

icu
Reverse Depends:
  libicu36
  libicu36
  libxercesicu26					* Xerces, again
  libxercesicu25
  libicu28-dev
  libicu28
  libicu21c102
  icu-i18ndata
  icu-data
  libwine							* Wine

This, of course, does not decrease the work required to get
this going in PostgreSQL.

Thanks for the great work,
Karsten
-- 
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346