On 5/31/22 11:46 AM, Jeff Ross wrote:
Hello,
We have a logically replicated table on RDS that is 39 G in size on
both the publisher (10.21) and the subscriber (12.8).
The replication slots on the publisher are all marked as active and
the lsns are current so no lag.
Other tables on the subscriber side are also identical in size and
have no problem with queries against them.
We did a vacuum full on the subscriber table and the size dropped to a
couple hundred MBs and is growing but still has no visible rows.
Now on the publisher I see a temporary replication slot so I'm
guessing that in reality replication had not finished and that's why
the load_events table was full sized but had no visible rows?
At this point I guess my questions are does my hunch that replication
hadn't completely finished is the root cause of what we saw make sense?
I've always used both the lsns and the lack of temporary replication
slots to tell when logical replication had finished--is there a more
authoritative way to do so?
Thanks,
Jeff
As a follow up for the archives...
It became apparent the longer we looked that logical replication in fact
had not finished and indeed never did finish. On both an EC2 server and
an RDS server we were missing at least one table that by table size
appeared to be identical to the publisher's table and yet had no visible
rows. On the publisher I would occasionally see temporary replication
slots but they would go away in a few minutes.
I then dropped the subscription on the RDS server and immediately
restarted it with (copy_data = False). As expected we immediately saw
rows being inserted into that massive table but it still showed only
those rows. I have a "backfill" process to sync up logically replicated
tables and that ran but very, very slowly. Finally while syncing that
big table it stalled completely. Network monitoring showed about
100Kb/s traffic to the RDS server. That is not a typo.
Our client spun up another RDS server, imported all the settings from
the previous one and I setup the database and imported the schemas. As
soon as we started replication we were seeing network transfers in the
45Mb/s range. Replication finished with all tables intact in under 5
hours for 420G.
We changed nothing on our side so whatever was causing the glacial data
transfer was on AWS side.
Jeff