I have found some questions about the same error, but didn't find any of them answering my problem.
The setup is that I have two Postgres11 clusters (A and B) and they are making use of publication and subscription features to copy data from A to B.
A (source DB- publication) --------------> B (target DB - subscription)
This works fine, but often (not always) when the data volume being inserted on a table in node A increases, it gives the following error.
"terminating walsender process due to replication timeout"
The data volume at the moment being entered is about 30K rows per second continuously for hours through COPY
command.
Earlier the wal_sender_timeout
was set to 5 sec and I would see this error much often. I then increased it to 1 min and the frequency of this error reduced. But I don't want to keep increasing it without understanding what is causing it. I looked at the code of walsender.c
and know the exact lines where it's coming from.
But I am still not clear which parameter is making the sender assume that the receiver node is inactive and therefore it should stop the wal_sender
.
Can anyone please suggest what changes I should make to remove this error?
sourcedb=# show wal_sender_timeout;
wal_sender_timeout
--------------------
1min
(1 row)
sourcedb=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
------------------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+----------------+--------------------
-
sub_target_DB | pgoutput | logical | 16501 | sourcedb | f | t | 68229 | | 98839088 | 116D0/C36886F8 | 116D0/C3E5D370
targetdb=# show wal_receiver_timeout;
wal_receiver_timeout
----------------------
1min
(1 row)
targetdb=# show wal_retrieve_retry_interval ;
wal_retrieve_retry_interval
-----------------------------
5s
(1 row)
targetdb=# show wal_receiver_status_interval;
wal_receiver_status_interval
------------------------------
2s
(1 row)
targetdb=# select * from pg_stat_subscription;
subid | subname | pid | relid | received_lsn | last_msg_send_time | last_msg_receipt_time | latest_end_lsn | l
atest_end_time
------------+------------------------------------+-------+-------+----------------+-------------------------------+-------------------------------+----------------+---------
----------------------
2378695757 | sub_target_DB | 62371 | | 116D1/2BA8F170 | 2021-08-20 09:05:15.398423+09 | 2021-08-20 09:05:15.398471+09 | 116D1/2BA8F170 | 2021-08-
20 09:05:15.398423+09
Increased the wal_sender_timeout
to 5 mins and the error started appearing more frequently instead. Not only that, it even killed the active subscription and stopped replicating data. Had to restart it. So clearly, just increasing the wal_sender_timeout
hasn't helped.
This correspondence (including any attachments) is for the intended recipient(s) only. It may contain confidential or privileged information or both. No confidentiality or privilege is waived or lost by any mis-transmission. If you receive this correspondence by mistake, please contact the sender immediately, delete this correspondence (and all attachments) and destroy any hard copies. You must not use, disclose, copy, distribute or rely on any part of this correspondence (including any attachments) if you are not the intended recipient(s).