Re: Mixing different LC_COLLATE and database encodings

Greg Stark <gsstark@xxxxxxx> · 20 Feb 2006 17:30:06 -0500

Martijn van Oosterhout <kleptog@xxxxxxxxx> writes:

> On Sat, Feb 18, 2006 at 08:16:07PM -0800, Bill Moseley wrote:
> > Is the Holy Grail encoding and lc_collate settings per column?
> 
> By way of example, see ICU which is an internationalisation library
> we're considering to get consistant locale support over all platforms.
> It supports one encoding, namely UTF-16. It has various functions to
> convert other encodings to or from that, but internally it's all
> UTF-16. So if we do use that, then all encodings (except native UTF-16)
> will need to conversion all the time, so you don't buy anything by
> having the server in some random encoding.

Ugh. At least from my perspective that makes it a non-starter. As I'm sure you
realize storage density is a major factor, often the dominant factor, in
database performance. Anything that would double the storage size for ascii
foreign keys is going to be a terrible hit.

And having to do a ascii->utf-16 conversion for every foreign key constraint
check would be nearly as bad. I know it's a simple conversion but compared to
a simple strcmp in a critical code path it's going to increase cpu usage
significantly.

I'm still unclear what advantage adding yet another external library
dependency gains Postgres in this area. The bulk of the difficulties seem to
be on the user interface side where it's unclear how to let users control this
functionality. It seems like the actual mechanics of sorting in various
locales can be handled using standard libc i18n functions.

The one issue people have raised is that traditional libc functions require
switching a global state between locales and not all implementations support
that well. But depending on a single non-standard extension seems better than
depending on a huge external library. Especially when the consequences of that
non-standard extension being missing is only that performance will suffer in a
case Postgres currently doesn't handle at all.

-- 
greg