On Thu, 2023-02-09 at 13:41 -0500, Tim wrote: > We have a cascading replication setup with multiple nodes 2 of which are in a different > cloud region for DR purposes. > > DRnode01 > DRnode02 > > They are both standby nodes, DRnode01 only recovers from the archive and does not connect > to any upstream node for streaming. DRnode02 has the same restore_command which points to > that WAL archive, but it is also configured with primary_conninfo to stream from DRnode01. > I'd like it to just WAL stream instead of using the archive, but it just ends up recovering > slightly faster from the archive than DRnode01 can send WALs so the logs end up being > spammed with > > > 2023-02-09 13:32:00 EST [1277714]: [1-1] user=,db=,app=,client= LOG: started streaming WAL from primary at 74CC/B5000000 on timeline 33 > > 2023-02-09 13:32:00 EST [1277714]: [2-1] user=,db=,app=,client= FATAL: could not receive data from WAL stream: ERROR: requested starting point 74CC/B5000000 is ahead of the WAL flush position of > > this server 74CC/B4FFFA08. > > pg_wal_replay_pause() does not work, it ends up in the same situation after resuming. > Changing restore_command requires a restart and turning it off altogether is not good > for DR. > > I cannot get it out of this loop and this has been a recurring issue for a while. > > Is there anything I can do to force to WAL stream instead of recovering from the archive > without removing the restore_command setting? My idea: - On DRnode01, set "archive_mode = always" and configure an "archive_command" that copies WAL segments to a directory shared with DRnode02 (e.g. via NFS) - DRnode02 has "primary_conninfo" that connects to DRnode01 and a "restore_command" that copies WAL segments from the shared directory. - DRnode02 has an "archive_cleanup_command" that removes WAL segments that it no longer needs from the shared directory. Yours, Laurenz Albe