On Sun, Feb 25, 2024 at 1:11 PM Mike Lissner <mlissner@xxxxxxxxxxxxxxxxxxxxx> wrote:
Sorry, two more little things here. The publisher logs add much, but here's what we see:STATEMENT: START_REPLICATION SLOT "pg_20031_sync_17418_7324846428853951375" LOGICAL F1D0/346C6508 (proto_version '2', publication_names '"compass_publication2"')
ERROR: replication slot "pg_20031_sync_17402_7324846428853951375" does not existAnd I thought that maybe there'd be some magic in the REFRESH command on the subscriber, so I tried that:alter subscription xyz refresh publication;
To nobody's surprise, that didn't help. :)On Sun, Feb 25, 2024 at 10:00 AM Mike Lissner <mlissner@xxxxxxxxxxxxxxxxxxxxx> wrote:Hi, I set up logical replication a few days ago, but it's throwing some weird log lines that have me worried. Does anybody have experience with lines like the following on a subscriber:LOG: logical replication table synchronization worker for subscription "compass_subscription", table "search_opinionscitedbyrecapdocument" has startedERROR: could not start WAL streaming: ERROR: replication slot "pg_20031_sync_17418_7324846428853951375" does not existLOG: background worker "logical replication worker" (PID 1014) exited with exit code 1Slots with this kind of name (pg_xyz_sync_*) are created during the initial sync, but it seems like the subscription is working based on a quick look in a few tables.I thought this might be related to running out of slots on the publisher, so I increased both max_replication_slots and max_wal_senders to 50 and rebooted so those would take effect. No luck.I thought rebooting the subscriber might help. No luck.When I look in the publisher to see the slots we have...SELECT * FROM pg_replication_slots;...I do not see the one that's missing according to the log lines.So it seems like the initial sync might have worked properly (tables have content), but that I have an errant process on the subscriber that might be stuck in a retry loop.I haven't been able to fix this, and I think my last attempt might be a new subscription with copy_data=false, but I'd rather avoid that if I can.Is there a way to fix or understand this so that I don't get the log lines forever and so that I can be confident the replication is in good shape?Thank you!Mike
Hi Mike
This happens when there is an error with another sync worker, PG has issues when this happens, the Logical Replication Main Worker attempts to sync another table, creates a slots, then tries to sync that table, then errors out as it is out of sync workers, then attempts to sync another table. the Subscriber is then stuck in a loop using up all the slots on the publisher,. The subscriber continues to cycle through all the tables that are not status r in the pg_subscription_rel table. Once there no slots available on the publisher the subscriber attempts to create slots, fails and returns the above errors.
These are not critical errors however it is difficult to clean up as every time starting and stopping the subscriber it immediately retries to initial sync the tables it has slots for. So it keeps using up all the publisher slots.
You need to look through the logs looking for the initial error that kicked this off, it can be a duplicate key or the subscriber or publisher disconnected due to wal_reciever_timeout or wal_sender_timerout.
I typical disable all the timeouts as it does not take much to cause the subscriber and publisher to think the connection has timed out.
A tell tale sign of this is the pg_subscription_rel status will be d for a bunch of tables. There should only be few tables that are status d,
I only attempt to clean this up when working on big databases and a lot of tables have already been sync.
Need to figure out the initial cause of the problem, correct that error and attempt an initial sync again.
.