Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPARE sid_sisisinst FROM :select_anw;'

Tom Lane <tgl@xxxxxxxxxxxxx> · Mon, 27 Apr 2020 09:40:39 -0400

Matthias Apitz <guru@xxxxxxxxxxx> writes:
> El día Montag, April 27, 2020 a las 08:40:04 -0400, Tom Lane escribió:
>> Can you get a stack trace from the connected backend?

> (gdb) bt
> #0  0x00007fd567776000 in epoll_pwait () from /lib64/libc.so.6
> #1  0x000000000084476c in WaitEventSetWaitBlock ()
> #2  0x0000000000844647 in WaitEventSetWait ()
> #3  0x00000000006f89d2 in secure_read ()
> #4  0x0000000000707425 in pq_recvbuf ()
> #5  0x0000000000707709 in pq_discardbytes ()
> #6  0x0000000000707aba in pq_getmessage ()
> #7  0x000000000086b478 in SocketBackend ()
> #8  0x000000000086b4c4 in ReadCommand ()
> #9  0x000000000086fda9 in PostgresMain ()

Oh, that is *very* interesting, because there is only one caller of
pq_discardbytes:

        /*
         * Allocate space for message.  If we run out of room (ridiculously
         * large message), we will elog(ERROR), but we want to discard the
         * message body so as not to lose communication sync.
         */
        PG_TRY();
        {
            enlargeStringInfo(s, len);
        }
        PG_CATCH();
        {
            if (pq_discardbytes(len) == EOF)
                ereport(COMMERROR,
                        (errcode(ERRCODE_PROTOCOL_VIOLATION),
                         errmsg("incomplete message from client")));
            ...

What this code intends to handle is the case where the client has sent a
message that is so long that the backend hasn't enough memory to buffer
it.  What's actually happened, more likely, is that the received message
length is corrupt and just appears to be large, since the client-side
trace shows that libpq has sent what it has to send and is now waiting for
a reply.  If the received length were correct then the pq_discardbytes
call would have completed after eating the message.

So what it looks like is that something is corrupting data on its
way from the client to the server.  Flaky firewall maybe?  If you're
using SSL, maybe an SSL library bug?  I'm reduced to speculation at
this point.  It's hard even to say what to do to gather more info.
If you could reproduce it then I'd suggest watching the connection
with wireshark or the like to see what data is actually going across
the wire ... but IIUC it's pretty random, so that approach seems
unlikely to work.

If you're in a position to run a modified server, you could try
inserting a debug log message:

        }
        PG_CATCH();
        {
+           elog(COMMERROR, "bogus received message length: %d", len);
            if (pq_discardbytes(len) == EOF)
                ereport(COMMERROR,

(This is in src/backend/libpq/pqcomm.c, around line 1300 as of HEAD.)
While this seems unlikely to teach us a huge amount, perhaps the
value of the incorrect length would be informative.

Are you always seeing this error at the exact same place so far as
the client side is concerned?  It's hard to see why a transport-level
problem would preferentially affect one particular query ...

			regards, tom lane