Re: Mixed Locales and Upgrading

Don Seiler <don@xxxxxxxxx> · Tue, 17 Mar 2020 08:06:52 -0500

On Mon, Mar 16, 2020 at 10:28 AM Tom Lane <tgl@xxxxxxxxxxxxx> wrote:

Egad.

My thoughts exactly.

Well, in principle you could likewise manually update pg_database's

datcollate and datctype columns to say "en_US.utf8".  However, there's

a much bigger problem here --- what steps if any did this cowboy take

to ensure that the data inside the database was valid UTF8?

No steps that I've seen from the chat history I've been able to search. I'm not sure if there was an (invalid) assumption that LATIN1 is a subset of UTF-8 or if it was done in a panic to get the import/update working years ago.

I don't think you should use pg_upgrade here at all.  A dump/restore

is really the only way to make sure that you have validly encoded data.

That is what I thought, and probably not what they'll want to hear given the downtime involved. Even with parallel dump/restore jobs, I imagine it will take quite a while (this first DB is almost 900GB).

However, if it's only one database out of a bunch, you could do something

like

* pg_dump that one database;

* drop said database;

* pg_upgrade everything else;

* restore that one database from dump.

In the case of this busy cluster, the layout is like this:

postgres=# \l
                                 List of databases
   Name    |  Owner   | Encoding |  Collate   |   Ctype    |   Access privileges
-----------+----------+----------+------------+------------+-----------------------
 foooo_all | postgres | UTF8     | en_US      | en_US      |
 postgres  | postgres | LATIN1   | en_US      | en_US      | =Tc/postgres         +
           |          |          |            |            | postgres=CTc/postgres
 template0 | postgres | LATIN1   | en_US      | en_US      | =c/postgres          +
           |          |          |            |            | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.utf8 | en_US.utf8 |
(4 rows)

So the template1 DB was dropped and recreated with the collate and ctype as well and then set to be a template again. But I believe that was well after foooo_all was changed (and so probably no need for the template1 change). In this case, if this is the only DB in the cluster, would it make sense to just create a new one as en_US.UTF-8 and then restore a dump of foooo_all into a pre-created en_US.UTF-8 DB?

We have a few other mixed environments similar to this as well. Some have postgres and both template DBs with this same UTF8/en_US/en_US configuration.
Is logical replication an option here? If the target DB were setup as en_US.UTF-8 across the board, would logical replication safely replicate and convert the data until we could then cut over?

Thanks,
Don.
-- 
Don Seiler
www.seiler.us