Re: Interesting streaming replication issue

Andres Freund <andres@xxxxxxxxxxx> · Wed, 9 Aug 2017 15:08:11 -0700

Hi,

On 2017-07-27 13:00:17 +1000, James Sewell wrote:
> Hi all,
> 
> I've got two servers (A,B) which are part of a streaming replication pair.
> A is the master, B is a hot standby. I'm sending archived WAL to a
> directory on A, B is reading it via SCP.
> 
> This all works fine normally. I'm on Redhat 7.3, running EDB 9.6.2 (I'm
> currently working to reproduce with standard 9.6)
> 
> We have recently seen a situation where B does not catch up when taken
> offline for maintenance.
> 
> When B is started we see the following in the logs:
> 
> 2017-07-27 11:56:03 AEST [21432]: [990-1] user=,db=,client=
> (0:00000)LOG:  restored log file "0000000C0000005A000000B5" from
> archive
> scp: /archive/xlog//0000000C0000005A000000B6: No such file or directory
> 2017-07-27 11:56:03 AEST [46191]: [1-1] user=,db=,client=
> (0:00000)LOG:  started streaming WAL from primary at 5A/B5000000 on
> timeline 12
> 2017-07-27 11:56:03 AEST [46191]: [2-1] user=,db=,client=
> (0:XX000)FATAL:  could not receive data from WAL stream: ERROR:
> requested WAL segment 0000000C0000005A000000B5 has already been
> removed
> 
> scp: /archive/xlog//0000000D.history: No such file or directory
> scp: /archive/xlog//0000000C0000005A000000B6: No such file or directory
> 2017-07-27 11:56:04 AEST [46203]: [1-1] user=,db=,client=
> (0:00000)LOG:  started streaming WAL from primary at 5A/B5000000 on
> timeline 12
> 2017-07-27 11:56:04 AEST [46203]: [2-1] user=,db=,client=
> (0:XX000)FATAL:  could not receive data from WAL stream: ERROR:
> requested WAL segment 0000000C0000005A000000B5 has already been
> removed
> 
> This will loop indefinitely. At this stage the master reports no connected
> standbys in pg_stat_replication, and the standby has no running WAL
> receiver process.
> 
> This can be 'fixed' by running pg_switch_xlog() on the master, at which
> time a connection is seen from the standby and the logs show the following:
> 
> scp: /archive/xlog//0000000D.history: No such file or directory
> 2017-07-27 12:03:19 AEST [21432]: [1029-1] user=,db=,client=  (0:00000)LOG:
>  restored log file "0000000C0000005A000000B5" from archive
> scp: /archive/xlog//0000000C0000005A000000B6: No such file or directory
> 2017-07-27 12:03:19 AEST [63141]: [1-1] user=,db=,client=  (0:00000)LOG:
>  started streaming WAL from primary at 5A/B5000000 on timeline 12
> 2017-07-27 12:03:19 AEST [63141]: [2-1] user=,db=,client=  (0:XX000)FATAL:
>  could not receive data from WAL stream: ERROR:  requested WAL segment
> 0000000C0000005A000000B5 has already been removed
> 
> scp: /archive/xlog//0000000D.history: No such file or directory
> 2017-07-27 12:03:24 AEST [21432]: [1030-1] user=,db=,client=  (0:00000)LOG:
>  restored log file "0000000C0000005A000000B5" from archive
> 2017-07-27 12:03:24 AEST [21432]: [1031-1] user=,db=,client=  (0:00000)LOG:
>  restored log file "0000000C0000005A000000B6" from archive

FWIW, I don't see a bug here. Archiving on its own doesn't guarantee
that replication progresses in increments smaller than 16MB, unless you
use archive_timeout (or as you do manually switch segments). Streaming
replication doesn't guarantee that WAL is retained unless you use
replication slots - which you don't appear to be. You can make SR retain
more with approximate methods like wal_keep_segments too, but that's not
a guarantee.  From what I can see you're just seeing the combination of
these two limitations, because you don't use the methods to address them
(archive_timeout, replication slots and/or wal_keep_segments).

Greetings,

Andres Freund

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general