Re: Dealing with latency to replication slave; what to do?

Jeff Janes <jeff.janes@xxxxxxxxx> · Tue, 24 Jul 2018 22:13:31 -0400

Please don't top-post, it is not the custom on this list.

On Tue, Jul 24, 2018 at 4:08 PM, Rory Falloon <rfalloon@xxxxxxxxx> wrote:
On Tue, Jul 24, 2018 at 4:02 PM Andres Freund <andres@xxxxxxxxxxx> wrote:
Hi,

On 2018-07-24 15:39:32 -0400, Rory Falloon wrote:

> Looking for any tips here on how to best maintain a replication slave which

> is operating under some latency between networks - around 230ms. On a good

> day/week, replication will keep up for a number of days, but however, when

> the link is under higher than average usage, keeping replication active can

> last merely minutes before falling behind again.

> 

> 2018-07-24 18:46:14 GMTLOG:  database system is ready to accept read only

> connections

> 2018-07-24 18:46:15 GMTLOG:  started streaming WAL from primary at

> 2B/93000000 on timeline 1

> 2018-07-24 18:59:28 GMTLOG:  incomplete startup packet

> 2018-07-24 19:15:36 GMTLOG:  incomplete startup packet

> 2018-07-24 19:15:36 GMTLOG:  incomplete startup packet

> 2018-07-24 19:15:37 GMTLOG:  incomplete startup packet

> 

> As you can see above, it lasted about half an hour before falling out of

> sync.

How can we see that from the above? The "incomplete startup messages"

are independent of streaming rep? I think you need to show us more logs.

regarding your first reply, I was inferring that from the fact I saw those messages at the same time the replication stream fell behind. What other logs would be more pertinent to this situation?

This is circular.  You think it lost sync because you saw some message you didn't recognize, and then you think the error message was related to it losing sync because they occured at the same time.  What evidence do you have that it has lost sync at all? From the log file you posted, it seems the server is running fine and is just getting probed by a port scanner, or perhaps by a monitoring tool.

If it had lost sync, you would be getting log messages about "requested WAL segment has already been removed"

Cheers,

Jeff

On Tue, Jul 24, 2018 at 4:08 PM, Rory Falloon <rfalloon@xxxxxxxxx> wrote:
Hi Andres,
regarding your first reply, I was inferring that from the fact I saw those messages at the same time the replication stream fell behind. What other logs would be more pertinent to this situation?

On Tue, Jul 24, 2018 at 4:02 PM Andres Freund <andres@xxxxxxxxxxx> wrote:
Hi,

On 2018-07-24 15:39:32 -0400, Rory Falloon wrote:

> Looking for any tips here on how to best maintain a replication slave which

> is operating under some latency between networks - around 230ms. On a good

> day/week, replication will keep up for a number of days, but however, when

> the link is under higher than average usage, keeping replication active can

> last merely minutes before falling behind again.

> 

> 2018-07-24 18:46:14 GMTLOG:  database system is ready to accept read only

> connections

> 2018-07-24 18:46:15 GMTLOG:  started streaming WAL from primary at

> 2B/93000000 on timeline 1

> 2018-07-24 18:59:28 GMTLOG:  incomplete startup packet

> 2018-07-24 19:15:36 GMTLOG:  incomplete startup packet

> 2018-07-24 19:15:36 GMTLOG:  incomplete startup packet

> 2018-07-24 19:15:37 GMTLOG:  incomplete startup packet

> 

> As you can see above, it lasted about half an hour before falling out of

> sync.

How can we see that from the above? The "incomplete startup messages"

are independent of streaming rep? I think you need to show us more logs.

> On the master, I have wal_keep_segments=128. What is happening when I see

> "incomplete startup packet" - is it simply the slave has fallen behind,

> and  cannot 'catch up' using the wal segments quick enough? I assume the

> slave is using the wal segments to replay changes and assuming there are

> enough wal segments to cover the period it cannot stream properly, it will

> eventually recover?

You might want to look into replication slots to ensure the primary

keeps the necessary segments, but not more, around.  You might also want

to look at wal_compression, to reduce the bandwidth usage.

Greetings,

Andres Freund