Hi, On 2017-07-27 13:00:17 +1000, James Sewell wrote: > Hi all, > > I've got two servers (A,B) which are part of a streaming replication pair. > A is the master, B is a hot standby. I'm sending archived WAL to a > directory on A, B is reading it via SCP. > > This all works fine normally. I'm on Redhat 7.3, running EDB 9.6.2 (I'm > currently working to reproduce with standard 9.6) > > We have recently seen a situation where B does not catch up when taken > offline for maintenance. > > When B is started we see the following in the logs: > > 2017-07-27 11:56:03 AEST [21432]: [990-1] user=,db=,client= > (0:00000)LOG: restored log file "0000000C0000005A000000B5" from > archive > scp: /archive/xlog//0000000C0000005A000000B6: No such file or directory > 2017-07-27 11:56:03 AEST [46191]: [1-1] user=,db=,client= > (0:00000)LOG: started streaming WAL from primary at 5A/B5000000 on > timeline 12 > 2017-07-27 11:56:03 AEST [46191]: [2-1] user=,db=,client= > (0:XX000)FATAL: could not receive data from WAL stream: ERROR: > requested WAL segment 0000000C0000005A000000B5 has already been > removed > > scp: /archive/xlog//0000000D.history: No such file or directory > scp: /archive/xlog//0000000C0000005A000000B6: No such file or directory > 2017-07-27 11:56:04 AEST [46203]: [1-1] user=,db=,client= > (0:00000)LOG: started streaming WAL from primary at 5A/B5000000 on > timeline 12 > 2017-07-27 11:56:04 AEST [46203]: [2-1] user=,db=,client= > (0:XX000)FATAL: could not receive data from WAL stream: ERROR: > requested WAL segment 0000000C0000005A000000B5 has already been > removed > > This will loop indefinitely. At this stage the master reports no connected > standbys in pg_stat_replication, and the standby has no running WAL > receiver process. > > This can be 'fixed' by running pg_switch_xlog() on the master, at which > time a connection is seen from the standby and the logs show the following: > > scp: /archive/xlog//0000000D.history: No such file or directory > 2017-07-27 12:03:19 AEST [21432]: [1029-1] user=,db=,client= (0:00000)LOG: > restored log file "0000000C0000005A000000B5" from archive > scp: /archive/xlog//0000000C0000005A000000B6: No such file or directory > 2017-07-27 12:03:19 AEST [63141]: [1-1] user=,db=,client= (0:00000)LOG: > started streaming WAL from primary at 5A/B5000000 on timeline 12 > 2017-07-27 12:03:19 AEST [63141]: [2-1] user=,db=,client= (0:XX000)FATAL: > could not receive data from WAL stream: ERROR: requested WAL segment > 0000000C0000005A000000B5 has already been removed > > scp: /archive/xlog//0000000D.history: No such file or directory > 2017-07-27 12:03:24 AEST [21432]: [1030-1] user=,db=,client= (0:00000)LOG: > restored log file "0000000C0000005A000000B5" from archive > 2017-07-27 12:03:24 AEST [21432]: [1031-1] user=,db=,client= (0:00000)LOG: > restored log file "0000000C0000005A000000B6" from archive FWIW, I don't see a bug here. Archiving on its own doesn't guarantee that replication progresses in increments smaller than 16MB, unless you use archive_timeout (or as you do manually switch segments). Streaming replication doesn't guarantee that WAL is retained unless you use replication slots - which you don't appear to be. You can make SR retain more with approximate methods like wal_keep_segments too, but that's not a guarantee. From what I can see you're just seeing the combination of these two limitations, because you don't use the methods to address them (archive_timeout, replication slots and/or wal_keep_segments). Greetings, Andres Freund -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general