Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC in pg_wal "No space left on device"

Rui DeSousa <rui@xxxxxxxxxxxxx> · Tue, 13 Nov 2018 20:52:10 -0500

On Nov 13, 2018, at 11:06 AM, Achilleas Mantzios <achill@xxxxxxxxxxxxxxxxxxxxx> wrote:

    On 13/11/18 5:35 μ.μ., Rui DeSousa
      wrote:

          On Nov 13, 2018, at 7:00 AM, Achilleas Mantzios
            <achill@xxxxxxxxxxxxxxxxxxxxx>
            wrote:

            Is there a way for the WAL receiver
              to not have detected the termination of the replication
              stream?

      The teardown of the network socket on the upstream
        server should send a reset packet to the downstream server and
        at that point the WAL receiver would close its connection.  Is
        there any firewalls, router, rules, etc between the nodes that
        could have dropped the packet? 

    No

If both nodes are up and able to communicate with each other then the OS should teardown the connection; why is that not happening?  You stated that you’re in the cloud; cloud providers use a software network and all endpoints have ACLs — a.k.a firewall. I would also check to see if iptables is running… it on by default.

Have there been changes to how you deploy new ACLs, etc? 

The fact that WAL sender terminated and WAL receiver did not; points to the WAL receiver not being able to communicate with the upstream server.

          Shouldn't normally the WAL receiver detect this
            and try again in wal_retrieve_retry_interval ?

      Not really… if the connection has already been torn
        down; the upstream server would send another reset packet on the
        next request and in this case it would.  However, if request
        packets at not reaching the upstream server; i.e. due to
        firewall silently dropping the packets (personally I believe
        firewall should always set reset packets to friendly hosts) then
        what happens is the TCP/IP send queue builds up with the
        requests packets instead — a t this point waiting on the OS to
        terminate the connection which can day or two depending on your
        TCP/IP setting.

    Again no dropping, no firewall.

Again, if both nodes are up and are able to communicate then the this should get resolved on its own.

      What you want to use instead is wal_receiver_timeout
        to detect the given case where upstream server either no longer
        exists or the firewall, etc is silently dropping packets.

    Once again from my original message :

    "while setting up logical replication since August we had seen early
    on the need to increase max_receiver_timeout and max_sender_timeout
    from 60sec to 5mins"

    So with wal_receiver_timeout='5 min', the receiver never detected
    any timeout.

It should have reached the timeout and the connection should be torn down.  It could be that the send queue is backed up and Postgres is hung on trying to teardown the connection.  Again, network related issue.

I’ve actually ran into this issue before; where I was unable to terminate an idle session from an application node that had an open transaction.  The TCP/IP send queue was backlogged thus it would not terminate; one might consider this a bug or not.  The application server was not reachable due to firewall dropping the connection and not sending a reset packet on further attempts to communicate with the application server; I had to get a network admin to drop the connection as bouncing Postgres was not an option on a production system and kill -9 is a bad idea too.  

Assuming this is going to happen again — I would advise you to get from both nodes the state of connection and tcp/ip queue information from netstat.   If you can also to a tcpdump on the connection to see what each of the nodes is doing that would give you more insight.

I would also advise looking into TCP/IP keep alive.