Re: Mixing different LC_COLLATE and database encodings

Martijn van Oosterhout <kleptog@xxxxxxxxx> · Tue, 21 Feb 2006 00:11:33 +0100

On Mon, Feb 20, 2006 at 05:30:06PM -0500, Greg Stark wrote:
> Martijn van Oosterhout <kleptog@xxxxxxxxx> writes:
> > By way of example, see ICU which is an internationalisation library
> > we're considering to get consistant locale support over all platforms.
> > It supports one encoding, namely UTF-16. It has various functions to
> > convert other encodings to or from that, but internally it's all
> > UTF-16. So if we do use that, then all encodings (except native UTF-16)
> > will need to conversion all the time, so you don't buy anything by
> > having the server in some random encoding.
> 
> Ugh. At least from my perspective that makes it a non-starter. As I'm sure you
> realize storage density is a major factor, often the dominant factor, in
> database performance. Anything that would double the storage size for ascii
> foreign keys is going to be a terrible hit.
> 
> And having to do a ascii->utf-16 conversion for every foreign key constraint
> check would be nearly as bad. I know it's a simple conversion but compared to
> a simple strcmp in a critical code path it's going to increase cpu usage
> significantly.

I'm not sure why you're singling out foreign keys here, but one of the
motivations for this COLLATE stuff I'm working on is so you can declare
all the system catalogs as COLLATE 'C' and thus always use strcmp and
*only* in the case where the user explicitly says "I want this column
sorted using French rules" do we incur the overhead. So your example
would be fine.

If we switched to ICU now the overhead could be nasty. We need COLLATE
first.

> I'm still unclear what advantage adding yet another external library
> dependency gains Postgres in this area. The bulk of the difficulties seem to
> be on the user interface side where it's unclear how to let users control this
> functionality. It seems like the actual mechanics of sorting in various
> locales can be handled using standard libc i18n functions.

How about consistancy across platforms? Isn't that the reason we went
for an external timezone library rather than using the system one? How
about not knowing what encoding libc actually expects for strcoll? How
about supporting multiple collations within a single database (say
French and Russian). For example, none of the BSDs or MacOS X support
collations for UTF-8 locales. They're not complaining now but this
seems untenable for the future.

> The one issue people have raised is that traditional libc functions require
> switching a global state between locales and not all implementations support
> that well. But depending on a single non-standard extension seems better than
> depending on a huge external library. Especially when the consequences of that
> non-standard extension being missing is only that performance will suffer in a
> case Postgres currently doesn't handle at all.

The way I'm going at the moment is that ICU would be optional. Without
it *BSD would be limited to what we do now: one locale per DB, no
changes. Linux, Mac OS X and Win32 would be able to support multiple
locales, whatever their system supports. With ICU all platforms support
the entire range supported by it. If you don't like ICU, don't use it.

I'm not going to play games with calling setlocale() to keep changing
state. You saw how Perl reacted to us playing with it. Better we stop
using setlocale() altogether and go with newlocale() wherever possible.

The chance that ICU will be installed on your system grows by the day.
The facilities provided by ICU are so far ahead of what libc provides
I'm not sure it's senseble to compare them.

Have a nice day,
-- 
Martijn van Oosterhout   <kleptog@xxxxxxxxx>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.
Attachment:
signature.asc

Description: Digital signature