Ben wrote:
On Fri, 16 May 2008, Scott Marlowe wrote:
Well, I'm guessing that you aren't in locale=C and that the text
Correct, I am not. And my understanding is that by moving to the C
locale, I would loose utf8 validation, so I don't want to go there.
Though, it's news to me that I would get any kind of select performance
boost with locale=C. Why would it help?
As far as I know the difference is that in the "C" locale PostgreSQL can
use simple byte-ordinal-oriented rules for sorting, character access,
etc. It can ignore the possibility of a character being more than one
byte in size. It can also avoid having to consider pairs of characters
where the ordinality of the numeric byte value of the characters is not
the same as the ordinality of the characters in the locale (ie they
don't sort in byte-value order).
If I've understood it correctly ( I don't use "C" locale databases
myself and I have not tested any of this ) that means that two UTF-8
encoded strings stored in a "C" locale database might not compare how
you expect. They might sort in a different order to what you expect,
especially if one is a 2-byte or more char and the other is only 1 byte.
They might compare non-equal even though they contain the same sequence
of Unicode characters because one is in a decomposed form and one is in
a precomposed form. The database neither knows the encoding of the
strings nor cares about it; it's just treating them as byte sequences
without any interest in their meaning.
If you only ever work with 7-bit ASCII, that might be OK. Ditto if you
never rely on the database for text sorting and comparison.
Someone please yell at me if I've mistaken something here.
--
Craig Ringer