Logical replication slots stuck in catchup on a very large table

Mathieu Poussin <postgresql@xxxxxxxxxx> · Fri, 09 Jun 2023 11:05:23 +0200

Hello everyone.

I am working on a database migration from old post PostgreSQL 10 hosted in a Stolon in Kubernetes (yes I know, it's looking for troubles) to a managed AWS RDS Postgresql 15 database.

The only potential solution I found to do this migration is to use the logical replication, as we can't have much downtime and the database is quite big (around 2TB).

I made many tests on multiple non production environments and had quite strange results, on our staging platform for example I could not, so far, achieve a complete migration, I had many issues of EOF during wal streaming, I then tweaked the wal_sender_timeout and wal_received timeout when I tested the migration on the production database and it looks like this part of the issue is solved. (I will have to retest it on staging, I tested on production as maybe only staging had this issue)
2 others environment were able to do the migration correctly (same schema, without the WAL timeout tweaks), one being around 800GB, like staging and the other much smaller, around 80GB).

Apart from the timeout issue that looks solved so far, I have a problem with our biggest table (which is a link table for a many 2 many relation), it's the one that were usually failing on staging and now also failing on the production test migration.

There you have the information about the table in question :

back=> \d+ diagnostics_episodes
                               Table "public.diagnostics_episodes"
    Column     |  Type   | Collation | Nullable | Default | Storage | Stats target | Description
---------------+---------+-----------+----------+---------+---------+--------------+-------------
 diagnostic_id | bigint  |           |          |         | plain   |              |
 episode_id    | bigint  |           |          |         | plain   |              |
 is_deleted    | boolean |           | not null | false   | plain   |              |
Indexes:
    "diagnostics_episodes_unique" UNIQUE, btree (diagnostic_id, episode_id)
    "index_diagnostics_episodes_on_episode_id" btree (episode_id)
Check constraints:
    "diagnostics_episodes_diagnostic_id_null" CHECK (diagnostic_id IS NOT NULL)
    "diagnostics_episodes_episode_id_null" CHECK (episode_id IS NOT NULL)
Foreign-key constraints:
    "fk_rails_11c88d28cf" FOREIGN KEY (episode_id) REFERENCES episodes(id) ON UPDATE CASCADE ON DELETE CASCADE
    "fk_rails_5857fe47a7" FOREIGN KEY (diagnostic_id) REFERENCES diagnostics(id) ON UPDATE CASCADE ON DELETE CASCADE
Replica Identity: FULL
Access method: heap

The table size is around 279 GB (from \d)

We had to use replica identity full on this table as setting the replica identity to the unique index would require downtime to add the NOT NULL constraints on the 2 FK columns.

And this is the current state of the replication, all the tables are in R state, except this single table still stuck in F state, it looks like it is also blocking the main replication slot of the logical replication ?
(I restarted the postgresql not long ago to check if it would unblock it, no luck)

SUBSCRIBER: 

back=> SELECT
  pss.relid, pss.relid::regclass AS obj,
  pss.pid, pss.latest_end_time, pg_size_pretty(pspc.bytes_processed) as data_copied,
  pspc.tuples_processed,
  NOW() - pss.last_msg_receipt_time as age
FROM
  pg_stat_subscription AS pss
LEFT JOIN
  pg_stat_progress_copy AS pspc ON pss.pid = pspc.pid;
 relid |         obj          | pid  |        latest_end_time        | data_copied | tuples_processed |       age
-------+----------------------+------+-------------------------------+-------------+------------------+-----------------
       |                      | 6346 | 2023-06-09 08:39:23.523942+00 |             |                  | 00:09:57.941278
 16923 | diagnostics_episodes | 6347 | 2023-06-09 08:39:23.87963+00  |             |                  | 00:08:31.553164
(2 rows)

PUBLISHER:

production=# SELECT
  prs.slot_name, prs.slot_type, prs.database, psa.wait_event,
  psr.state, prs.temporary, prs.active, prs.active_pid, prs.restart_lsn,
  prs.confirmed_flush_lsn
FROM
  pg_replication_slots AS prs
LEFT JOIN
  pg_stat_activity AS psa ON prs.active_pid = psa.pid
LEFT JOIN
  pg_stat_replication AS psr ON psa.pid = psr.pid
WHERE
  prs.slot_type = 'logical';
                slot_name                | slot_type |  database  |     wait_event     |  state  | temporary | active | active_pid | restart_lsn  | confirmed_flush_lsn
-----------------------------------------+-----------+------------+--------------------+---------+-----------+--------+------------+--------------+---------------------
 azure_to_aws                            | logical   | production | WalSenderWriteData | catchup | f         | t      |      21534 | A6F/1EF5C210 | A6F/1F48F728
 pg_17845_sync_16923_7225826535892257639 | logical   | production | WalSenderWriteData | catchup | f         | t      |      21537 | A6D/22E49068 | A6D/22EDF188
(2 rows)

(the restart_lsn and confirmed_flush_lsn are not changing over time)

production=# SELECT
  prs.slot_name, prs.active, psr.state, prs.active_pid,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(),
  prs.confirmed_flush_lsn)) AS diff_size
FROM
  pg_replication_slots AS prs
LEFT JOIN pg_stat_replication AS psr ON prs.slot_name = psr.application_name
WHERE
  prs.slot_type = 'logical';
                slot_name                | active |  state  | active_pid | diff_size
-----------------------------------------+--------+---------+------------+-----------
 azure_to_aws                            | t      | catchup |      21534 | 2855 MB
 pg_17845_sync_16923_7225826535892257639 | t      | catchup |      21537 | 11 GB
(2 rows)

Something I can see on the monitoring of the subscriber is that it's doing a lot of read activity (I can't get more info as it's on RDS so no strace, etc..), I can see it's reading at around 400MBps non stop (and it stops if I disable the subscription and terminate the process related to the replication), but I don't see any write going thru.
On either side I don't see any error on the logs, on the subscriber I have the debug5 level log and I don't see any transaction being applied either by the replication worker.
On pg_stat_activity I see the logical replication worker of this table always active, sometime doing IO:DateFileRead, the main replication worker however is always "idle" waiting for IPC:LogicalFileSyncChange

I am not sure what do to next ? I have been trying to troubleshoot this for a week but no luck, I also got some help on the PostgreSQL slack server but we are not 100% sure of what could be the issue either ( https://postgresteam.slack.com/archives/C0FS3UTAP/p1686041198195869 )
On the loadtest environment I first migrated, I saw it had the NOT NULL constraint set on the columns of the table (but still with the REPLICA IDENTITY FULL) and it worked, I tried to remove the NOT NULL constraint and do again the migration and it still worked, so I am not sure it is related to that on the other environments?
What could be the other causes of this ?

Thank you.