DB encoding, locale and indexes

Sterfield <sterfield@xxxxxxxxx> · Thu, 5 Feb 2015 09:54:36 +0100

Hi everyone,

I'm a sysadmin working for an application that stores all its data in a PG database.
Currently, the cluster has its encoding set to UTF-8, and the locale (both LC_COLLATE and LC_CTYPE) is set to 'en_US.UTF-8'.

I discovered recently that the indexes created on varchar fields are not working for LIKE operator, as they are created without the correct class operator (as stated in http://www.postgresql.org/docs/9.2/static/indexes-types.html).

The most straight-forward solution would be to create a second index on the same field but with the class operator, in order to have indexes both for =, >, < operators AND LIKE / regexp operators. Few additional indexes, some diskspace eaten, problem solved.

However, some people are saying that nothing has to change on the index, and that the only thing we have to do is to change the LC_COLLATE of each databases to 'C', in order for the indexes to work without the class operator.

So, we may have a database, with UTF-8 encoding, LC_TYPE to 'en_US.UTF-8' and LC_COLLATE to 'C'.

This configuration seems to be really weird to me, that's why I'm asking for your help here.

Few questions : 
Is it even possible ? (documentation seems to answer 'yes' to this question, according to http://www.postgresql.org/docs/current/static/multibyte.html)
If I have unicode character stored in my database (for example 'é'), and the LC_COLLATE set to 'C', how the index will behave if I do a query with LIKE '<something>é%' ?
Many thanks for your help.
Guillaume.