Search Postgresql Archives

Re: DB encoding, locale and indexes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



2015-02-05 15:56 GMT+01:00 Tom Lane <tgl@xxxxxxxxxxxxx>:
Sterfield <sterfield@xxxxxxxxx> writes:
> I'm a sysadmin working for an application that stores all its data in a PG
> database.
> Currently, the cluster has its encoding set to UTF-8, and the locale (both
> LC_COLLATE and LC_CTYPE) is set to 'en_US.UTF-8'.

> I discovered recently that the indexes created on varchar fields are not
> working for LIKE operator, as they are created without the correct class
> operator (as stated in
> http://www.postgresql.org/docs/9.2/static/indexes-types.html).

Right, because en_US.UTF-8 uses dictionary sort order rather than plain
byte-by-byte sort.

> The most straight-forward solution would be to create a second index on the
> same field but with the class operator, in order to have indexes both for
> =, >, < operators AND LIKE / regexp operators. Few additional indexes, some
> diskspace eaten, problem solved.

Yup.

> However, some people are saying that nothing has to change on the index,
> and that the only thing we have to do is to change the LC_COLLATE of each
> databases to 'C', in order for the indexes to work without the class
> operator.

Yes, that is another possible solution, and it's documented.  Keep in mind
though that you can *not* just reach into pg_database and tweak those
fields; if you did, all your indexes would be corrupt, because they'd no
longer match the sort order the system is expecting.  The only safe way to
get there would be to dump and reload into a new database set up this way.
(If you wanted to live dangerously, I guess you could manually tweak the
pg_database fields and then REINDEX every affected index ... but this
still involves substantial downtime, and I would not recommend doing it
without practicing on a test installation.)

Yeah, I'll not take the risk. For current databases, I'll probably create manually new indexes. For new environment, I'll change the LC_COLLATE to 'C'. I've spent some time re-creating a test environment, using encoding to UTF8, locale to 'en_US.UTF-8' except LC_COLLATE set to 'C'. Nothing special to report, the index is working as expected for LIKE operators, and I have correct answers if I'm doing a LIKE '<something>é%'.


You also have to ask whether any of your applications are expecting ORDER
BY some-text-field to produce dictionary order rather than ASCII order.

Indeed, the order of the results is not the same with a LC_COLLATE to 'en_US.UTF-8' or LC_COLLATE to 'C', but I highly doubt that the application is taking advantage of having an index already sorted.


>    - If I have unicode character stored in my database (for example 'é'),
>    and the LC_COLLATE set to 'C', how the index will behave if I do a query
>    with LIKE '<something>é%' ?

It's still the same character, but it will sort in a possibly unexpected
way.

                        regards, tom lane

Many thanks for your help, guys, especially on this non-trivial subject (at least, for me).

Cheers,

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux