Re: Ways to change a database collation with removing duplicates

Erik Wienhold <ewie@xxxxxxxxx> · Sat, 11 Mar 2023 20:20:12 +0100 (CET)

> On 11/03/2023 19:23 CET Alexey Murz Korepov <murznn@xxxxxxxxx> wrote:
>
> Could anyone suggest to me the ways to change a database collation with
> removing all the duplicates, caused by this change?

Collations can only affect uniqueness if they are nondeterministic or if you
have functional indexes, e.g. using lower(text) for a case-insensitive unique
index.  Otherwise the collations only affect text ordering.

You need to find all duplicates in the original database (or a copy) by using
the target collation for comparisons[0].  Delete all duplicates, pg_dump that
database and restore in database with new collation.

> I have a pretty large database (around 500 Gb) that was created with
> `en_US.UTF-8` collation, but the new version of the application requires
> that the collation should be strictly `C`.

Both collations should be deterministic[1] which means they consider byte
sequences when comparing strings for equality.  What does pg_collation say?

	select * from pg_collation where collname in ('C', 'en_US.utf8')

> I can successfully create a dump of the old database using `pgdump`.
>
> But when I'm importing the dump to the new database with `COLLATE=C`, I see
> a lot of errors on ALTER TABLE when creating primary keys, and the same -
> for `CREATE INDEX` commands:

Restore the dump in a new database with same collation en_US.UTF-8 to rule out
that the duplicates come from collation C and do not already exist in the
original database.

> ALTER TABLE
> ERROR:  could not create unique index "access_tokens_pkey"
> DETAIL:  Key (id)=(16734) is duplicated.

Is column "id" an integer?  If so then this looks like a corrupt index in the
original database because collations should not affect indexes on non-text
columns.

> ERROR:  could not create unique index "access_tokens_token_key"
> DETAIL:  Key (token)=(XXX) is duplicated.
> CONTEXT:  parallel worker
> ERROR:  could not create unique index "account_data_uniqueness"
> DETAIL:  Key (user_id, account_data_type)=(@username:XXX, im.vector.setting.breadcrumbs) is duplicated.
> CREATE INDEX
> ERROR:  could not create unique index "e2e_cross_signing_keys_idx"
> DETAIL:  Key (user_id, keytype, stream_id)=(@-----------------:matrix.org (http://matrix.org), master, 1606172) is duplicated.
> ERROR:  could not create unique index "e2e_cross_signing_keys_stream_idx"
> DETAIL:  Key (stream_id)=(1779009) is duplicated.
>
> So, could anyone give some advice on how to perform the collation change
> with cleaning out all the duplicates? Thanks!

[0] https://www.postgresql.org/docs/current/indexes-collations.html
[1] https://www.postgresql.org/docs/current/collation.html#COLLATION-NONDETERMINISTIC

--
Erik