Re: C locale versus en_US.UTF8. (Was: String comparision in PostgreSQL)

Dmitriy Igrishin <dmitigr@xxxxxxxxx> · Wed, 29 Aug 2012 23:14:04 +0400

2012/8/29 Merlin Moncure <mmoncure@xxxxxxxxx>

On Wed, Aug 29, 2012 at 12:43 PM, Bruce Momjian <bruce@xxxxxxxxxx> wrote:

> On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote:

>> On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@xxxxxxxxx> wrote:

>> > citext unfortunately doesn't allow for index optimization of LIKE

>> > queries, which IMNSHO defeats the whole purpose.  to the best way

>> > remains to use lower() ...

>> > this will be index optimized and fast as long as you specified C

>> > locale for your database.

>>

>> What is the difference between C and en_US.UTF8, please?  We see that

>> the same query (that invokes a sort) runs 15% faster under the C

>> locale.  The output between C and en_US.UTF8 is identical.  We're

>> considering moving our database from en_US.UTF8 to C, but we do deal

>> with internationalized text.

>

> Well, C has reduced overhead for string comparisons, but obviously

> doesn't work well for international characters.  The single-byte

> encodings have somewhat less overhead than UTF8.  You can try using C

> locales for databases that don't require non-ASCII characters.

To add:

The middle ground I usually choose is to have a database encoding of

UTF8 but with the C (aka POSIX) locale.  This gives you the ability to

store any unicode but indexing operations will use the faster C string

comparison operations for a significant performance boost --

especially for partial string searches on an indexed column.  This is

an even more attractive option in 9.1 with the ability to specify

specific collations at runtime.
Good point! Thanks!

-- 
// Dmitriy.