RE: An I/O error occurred while sending to the backend (PG 13.4)

"ldh@xxxxxxxxxxxxxxxxxx" <ldh@xxxxxxxxxxxxxxxxxx> · Tue, 1 Mar 2022 16:28:31 +0000

   >  -----Original Message-----
   >  From: Justin Pryzby <pryzby@xxxxxxxxxxxxx>
   >  Sent: Monday, February 28, 2022 17:05
   >  To: ldh@xxxxxxxxxxxxxxxxxx
   >  Cc: pgsql-performance@xxxxxxxxxxxxxx
   >  Subject: Re: An I/O error occurred while sending to the backend (PG 13.4)
   >  
   >  On Mon, Feb 28, 2022 at 09:43:09PM +0000, ldh@xxxxxxxxxxxxxxxxxx
   >  wrote:
   >  >    On Wed, Feb 23, 2022 at 07:04:15PM -0600, Justin Pryzby wrote:
   >  >    >  > And the aforementioned network trace.  You could set a capture
   >  filter on TCP
   >  >    >  > SYN|RST so it's not absurdly large.  From my notes, it might look like
   >  this:
   >  >    >  > (tcp[tcpflags]&(tcp-rst|tcp-syn|tcp-fin)!=0)
   >  >    >
   >  >    >  I'd also add '|| icmp'.  My hunch is that you'll see some ICMP (not
   >  "ping")
   >  >    >  being sent by an intermediate gateway, resulting in the connection
   >  being
   >  >    >  reset.
   >  >
   >  > I am so sorry but I do not understand what you are asking me to do. I am
   >  unfamiliar with these commands. Is this a postgres configuration file? Is this
   >  something I just do once or something I leave on to hopefully catch it when
   >  the issue occurs? Is this something to do on the DB machine or the ETL
   >  machine? FYI:
   >  
   >  It's no problem.
   >  
   >  I suggest that you run wireshark with a capture filter to try to show *why*
   >  the connections are failing.  I think the capture filter might look like:
   >  
   >  (icmp || (tcp[tcpflags] & (tcp-rst|tcp-syn|tcp-fin)!=0)) && host
   >  10.64.17.211
   >  
   >  With the "host" filtering for the IP address of the *remote* machine.
   >  
   >  You could run that on whichever machine is more convenient and leave it
   >  running for however long it takes for that error to happen.  You'll be able to
   >  save a .pcap file for inspection.  I suppose it'll show either a TCP RST or an
   >  ICMP.
   >  Whichever side sent that is where the problem is.  I still suspect the issue
   >  isn't in postgres.
   >  
   >  >   - My ETL machine is on 10.64.17.211
   >  >   - My DB machine is on 10.64.17.210
   >  >   - Both on Windows Server 2012 R2, x64
   >  
   >  These network details make my theory unlikely.
   >  
   >  They're on the same subnet with no intermediate gateways, and
   >  communicate directly via a hub/switch/crossover cable.  If that's true, then
   >  both will have each other's hardware address in ARP after pinging from one
   >  to the other.
   >  
   >  --
   >  Justin

Yes, the machines ARE on the same subnet. They actually even are on the same physical rack as per what I have been told. When I run a tracert, I get this:

Tracing route to PRODDB.xxx.int [10.64.17.210] over a maximum of 30 hops:
  1     1 ms    <1 ms    <1 ms  PRODDB.xxx.int [10.64.17.210]
Trace complete.

Now, there is an additional component I think... Storage is on an array and I am not getting a clear answer as to where it is 😊 Is it possible that something is happening at the storage layer? Could that be reported as a network issue vs a storage issue for Postgres?

Also, both machines are actually VMs. I forgot to mention that and not sure if that's relevant.

Thank you,
Laurent.