terminating walsender process due to replication timeout

AYahorau@xxxxxxxxxxx · Mon, 13 May 2019 14:36:06 +0300

Hello PostgreSQL Community!

I faced an issue on my linux machine
using Postgres 11.3 .

I have 2 nodes in db cluster: master
and standby.

I tried to perform a plenty of long-running
 queries which lead to the databases desynchronization:

terminating walsender process due to
replication timeout

Here is the output in debug mode:

2019-05-13 13:21:33 FET 00000 DEBUG:
 sending replication keepalive

2019-05-13 13:21:34 FET 00000 DEBUG:
 StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS,
xid/subid/cid: 0/1/0

2019-05-13 13:21:34 FET 00000 DEBUG:
 CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS,
xid/subid/cid: 0/1/0

2019-05-13 13:21:34 FET 00000 DEBUG:
 StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS,
xid/subid/cid: 0/1/0

2019-05-13 13:21:34 FET 00000 DEBUG:
 CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS,
xid/subid/cid: 0/1/0

2019-05-13 13:21:34 FET 00000 DEBUG:
 StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS,
xid/subid/cid: 0/1/0

2019-05-13 13:21:34 FET 00000 DEBUG:
 CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS,
xid/subid/cid: 0/1/0

2019-05-13 13:21:34 FET 00000 DEBUG:
 StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS,
xid/subid/cid: 0/1/0

2019-05-13 13:21:34 FET 00000 DEBUG:
 CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS,
xid/subid/cid: 0/1/0

2019-05-13 13:21:34 FET 00000 DEBUG:
 StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS,
xid/subid/cid: 0/1/0

2019-05-13 13:21:34 FET 00000 DEBUG:
 CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS,
xid/subid/cid: 0/1/0

2019-05-13 13:21:34 FET 00000 DEBUG:
 StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS,
xid/subid/cid: 0/1/0

2019-05-13 13:21:34 FET 00000 DEBUG:
 CommitTransaction(1) name: unnamed; blockState: END; state: INPROGRESS,
xid/subid/cid: 0/1/0

2019-05-13 13:21:34 FET 00000 LOG:  terminating
walsender process due to replication timeout

The issue is reproducible. I configure
2 nodes cluster, download demo_small.zip from https://edu.postgrespro.ru/
and run the following command:

psql -U user1 -f demo_small.sql db1

and I get the observed behaviour.

I know that I can increase wal_sender_timeout
value to avoid this behaviour (currently wal_sender_timeout is equal to
1 second.)

To be honest I don't want to increase
wal_sender_timeout because I would like to detect some network issues quickly.

After having googled I found that someone
faced a similar issue https://www.postgresql.org/message-id/e082a56a-fd95-a250-3bae-0fff93832510@xxxxxxxxxxxxxxx
which was fixed in  PostgreSQL 9.4.16.

Is my issue the same as described here
https://www.postgresql.org/message-id/e082a56a-fd95-a250-3bae-0fff93832510@xxxxxxxxxxxxxxx
?

Is there any  other chance to avoid
it without increasing wal_sender_timeout?

Thank you in advance.

Regards, 

Andrei