Re: terminating walsender process due to replication timeout

AYahorau@xxxxxxxxxxx · Wed, 15 May 2019 10:04:12 +0300

Hello,

Thank You for the response.

Yes that's possible to monitor replication delay. But my questions were
not about monitoring network issues. 

I use exactly wal_sender_timeout=1s because it allows to detect
replication problems quickly.

So, I need clarification to the following  questions:

Is  it possible to use exactly this configuration and be sure that
it will be work properly.

What did I do wrong? Should I correct my configuration somehow?

Is this the same issue  as mentioned here: https://www.postgresql.org/message-id/e082a56a-fd95-a250-3bae-0fff93832510@xxxxxxxxxxxxxxx
? If it is so, why I do I face this
problem again?

Thank you in advance.

Best regards,

Andrei

From:      
 Rene Romero Benavides
<rene.romero.b@xxxxxxxxx>

To:      
 AYahorau@xxxxxxxxxxx,

Cc:      
 Postgres General <pgsql-general@xxxxxxxxxxxxxx>

Date:      
 14/05/2019 20:12

Subject:    
   Re: terminating
walsender process due to replication timeout

To detect network issues maybe you could monitor replication
delay.

On Mon, May 13, 2019 at 6:42 AM <AYahorau@xxxxxxxxxxx>
wrote:

Hello PostgreSQL Community!

I faced an issue on my linux machine using Postgres 11.3 .

I have 2 nodes in db cluster: master and standby. 

I tried to perform a plenty of long-running  queries which lead to
the databases desynchronization: 

terminating walsender process due to replication timeout

Here is the output in debug mode: 

2019-05-13 13:21:33 FET 00000 DEBUG:  sending replication keepalive

2019-05-13 13:21:34 FET 00000 DEBUG:  StartTransaction(1) name: unnamed;
blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0

2019-05-13 13:21:34 FET 00000 DEBUG:  CommitTransaction(1) name: unnamed;
blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0

2019-05-13 13:21:34 FET 00000 DEBUG:  StartTransaction(1) name: unnamed;
blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0

2019-05-13 13:21:34 FET 00000 DEBUG:  CommitTransaction(1) name: unnamed;
blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0

2019-05-13 13:21:34 FET 00000 DEBUG:  StartTransaction(1) name: unnamed;
blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0

2019-05-13 13:21:34 FET 00000 DEBUG:  CommitTransaction(1) name: unnamed;
blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0

2019-05-13 13:21:34 FET 00000 DEBUG:  StartTransaction(1) name: unnamed;
blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0

2019-05-13 13:21:34 FET 00000 DEBUG:  CommitTransaction(1) name: unnamed;
blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0

2019-05-13 13:21:34 FET 00000 DEBUG:  StartTransaction(1) name: unnamed;
blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0

2019-05-13 13:21:34 FET 00000 DEBUG:  CommitTransaction(1) name: unnamed;
blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0

2019-05-13 13:21:34 FET 00000 DEBUG:  StartTransaction(1) name: unnamed;
blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0

2019-05-13 13:21:34 FET 00000 DEBUG:  CommitTransaction(1) name: unnamed;
blockState: END; state: INPROGRESS, xid/subid/cid: 0/1/0

2019-05-13 13:21:34 FET 00000 LOG:  terminating walsender process
due to replication timeout 

The issue is reproducible. I configure 2 nodes cluster, download demo_small.zip
from https://edu.postgrespro.ru/
and run the following command: 

psql -U user1 -f demo_small.sql db1 

and I get the observed behaviour. 

I know that I can increase wal_sender_timeout value to avoid this behaviour
(currently wal_sender_timeout is equal to 1 second.)

To be honest I don't want to increase wal_sender_timeout because I would
like to detect some network issues quickly. 

After having googled I found that someone faced a similar issue https://www.postgresql.org/message-id/e082a56a-fd95-a250-3bae-0fff93832510@xxxxxxxxxxxxxxx
which was fixed in  PostgreSQL 9.4.16. 

Is my issue the same as described here https://www.postgresql.org/message-id/e082a56a-fd95-a250-3bae-0fff93832510@xxxxxxxxxxxxxxx
? 

Is there any  other chance to avoid it without increasing wal_sender_timeout?

Thank you in advance. 

Regards, 

Andrei

-- 

El genio es 1% inspiración y 99% transpiración.

Thomas Alva Edison

http://pglearn.blogspot.mx/