Hello PostgreSQL Community!
Not long ago I opened an issue regarding
the problem about:
"terminating walsender process
due to replication timeout" problem:
https://www.postgresql.org/message-id/OF85C33E30.171C1C23-ON432583F9.003F5B16-432583F9.003FBAD7%40iba.by
Thanks Everyone for your comprehensive explanations.
Basically I faced this problem because of small wal_sender_timeout
value (1 second) and quite large volume of data to be replicated.
There were some assumptions about some problems with my network. But
during my investigation I assured that it was not network problem.
I took a look in postgresql source
code. As far as I understood walsender can send some data to walreceiver
regarding some changes and so-called keepalive messages.
Exactly these keepalive messages walsender sends periodically once per
wal_sender_timeout seconds (once per 1 second in my case) and expects
to get responses from the walreceiver that everything goes ok.
I switched on trace and repeated
my test. I found out that walreceiver starts processing of the "keepalive"
message in 3 seconds after its sending.
As far as I understand it happened because walreceiver was busy by accepting
all the changes from the master and writing it to its standby WAL logs.
So the standby postgres log was overloaded overfilled by the entries regarding
writing its own WALs:
18642 2019-05-28 05:08:31
EDT 00000 DEBUG: record known xact 559 latestObservedXid 559
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CE9C0 for
Heap/LOCK: off 53: xid 559: flags 0 LOCK_ONLY EXCL_LOCK
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid
559
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CE9F8 for
Heap/UPDATE: off 53 xmax 559 ; new off 129 xmax 0
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid
559
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CEA48 for
Heap/LOCK: off 54: xid 559: flags 0 LOCK_ONLY EXCL_LOCK
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid
559
18642 2019-05-28 05:08:31 EDT 00000 CONTEXT: WAL redo at 0/112CEA80 for
Heap/UPDATE: off 54 xmax 559 ; new off 130 xmax 0
18642 2019-05-28 05:08:31 EDT 00000 DEBUG: record known xact 559 latestObservedXid
559
So because of writing large volume of
data it was not able to handle the next messages quickly. It seems not
to be related to network bandwidth or CPU saturation.
Thereby, I see some kind of a contradiction with the official description
of wal_sender_timeout parameter:
Terminate replication
connections that are inactive longer than the specified number of milliseconds.
This is useful for the sending server to detect a standby crash or network
outage.
During my test the connection between master and standby was active and
there was no any network outage. So according to the description there
was no need to terminate replication connection.
So, I have some questions:
Is there any way (e. g. via configuration of other) to make the response
time to keepalive messages independent of the amount of data that the walreceiver
has to process?
If there is no such a way is it possible to update wal_sender_timeout
documentation so it reflects reality?
Best regards,
Andrei Yahorau