On 13/11/18 5:35 μ.μ., Rui
DeSousa wrote:
Is there a way
for the WAL receiver to not have detected the
termination of the replication stream?
The teardown of the network socket on the
upstream server should send a reset packet to the
downstream server and at that point the WAL receiver
would close its connection. Is there any firewalls,
router, rules, etc between the nodes that could have
dropped the packet?
No
If both nodes are up and able to communicate with each
other then the OS should teardown the connection; why is that
not happening? You stated that you’re in the cloud; cloud
providers use a software network and all endpoints have ACLs —
a.k.a firewall. I would also check to see if iptables is
running… it on by default.
Have there been changes to how you deploy new ACLs, etc?
The fact that WAL sender terminated and WAL receiver did
not; points to the WAL receiver not being able to communicate
with the upstream server.
Shouldn't normally the
WAL receiver detect this and try again in
wal_retrieve_retry_interval ?
Not really… if the connection has already
been torn down; the upstream server would send another
reset packet on the next request and in this case it
would. However, if request packets at not reaching
the upstream server; i.e. due to firewall silently
dropping the packets (personally I believe firewall
should always set reset packets to friendly hosts)
then what happens is the TCP/IP send queue builds up
with the requests packets instead — a t this point
waiting on the OS to terminate the connection which
can day or two depending on your TCP/IP setting.
Again no dropping, no firewall.
Again, if both nodes are up and are able to communicate
then the this should get resolved on its own.
What you want to use instead is
wal_receiver_timeout to detect the given case where
upstream server either no longer exists or the
firewall, etc is silently dropping packets.
Once again from my original message :
"while setting up logical replication since August we had
seen early on the need to increase max_receiver_timeout
and max_sender_timeout from 60sec to 5mins"
So with wal_receiver_timeout='5 min', the receiver never
detected any timeout.
It should have reached the timeout and the connection
should be torn down. It could be that the send queue is
backed up and Postgres is hung on trying to teardown the
connection. Again, network related issue.
I’ve actually ran into this issue before; where I was
unable to terminate an idle session from an application node
that had an open transaction. The TCP/IP send queue was
backlogged thus it would not terminate; one might consider
this a bug or not. The application server was not reachable
due to firewall dropping the connection and not sending a
reset packet on further attempts to communicate with the
application server; I had to get a network admin to drop the
connection as bouncing Postgres was not an option on a
production system and kill -9 is a bad idea too.
Assuming this is going to happen again — I would
advise you to get from both nodes the state of connection and
tcp/ip queue information from netstat. If you can also to a
tcpdump on the connection to see what each of the nodes is doing
that would give you more insight.
I would also advise looking into TCP/IP keep alive.