Re: very slow left join

Craig Ringer <craig@xxxxxxxxxxxxxxxxxxxxx> · Sat, 17 May 2008 03:25:17 +0800

Ben wrote:
On Fri, 16 May 2008, Scott Marlowe wrote:

Well, I'm guessing that you aren't in locale=C and that the text

Correct, I am not. And my understanding is that by moving to the C 
locale, I would loose utf8 validation, so I don't want to go there. 
Though, it's news to me that I would get any kind of select performance 
boost with locale=C. Why would it help?

As far as I know the difference is that in the "C" locale PostgreSQL can 
use simple byte-ordinal-oriented rules for sorting, character access, 
etc. It can ignore the possibility of a character being more than one 
byte in size. It can also avoid having to consider pairs of characters 
where the ordinality of the numeric byte value of the characters is not 
the same as the ordinality of the characters in the locale (ie they 
don't sort in byte-value order).

If I've understood it correctly ( I don't use "C" locale databases 
myself and I have not tested any of this ) that means that two UTF-8 
encoded strings stored in a "C" locale database might not compare how 
you expect. They might sort in a different order to what you expect, 
especially if one is a 2-byte or more char and the other is only 1 byte. 
They might compare non-equal even though they contain the same sequence 
of Unicode characters because one is in a decomposed form and one is in 
a precomposed form. The database neither knows the encoding of the 
strings nor cares about it; it's just treating them as byte sequences 
without any interest in their meaning.

If you only ever work with 7-bit ASCII, that might be OK. Ditto if you 
never rely on the database for text sorting and comparison.

Someone please yell at me if I've mistaken something here.

--
Craig Ringer