Matthias Apitz <guru@xxxxxxxxxxx> writes: > El día Montag, April 27, 2020 a las 08:40:04 -0400, Tom Lane escribió: >> Can you get a stack trace from the connected backend? > (gdb) bt > #0 0x00007fd567776000 in epoll_pwait () from /lib64/libc.so.6 > #1 0x000000000084476c in WaitEventSetWaitBlock () > #2 0x0000000000844647 in WaitEventSetWait () > #3 0x00000000006f89d2 in secure_read () > #4 0x0000000000707425 in pq_recvbuf () > #5 0x0000000000707709 in pq_discardbytes () > #6 0x0000000000707aba in pq_getmessage () > #7 0x000000000086b478 in SocketBackend () > #8 0x000000000086b4c4 in ReadCommand () > #9 0x000000000086fda9 in PostgresMain () Oh, that is *very* interesting, because there is only one caller of pq_discardbytes: /* * Allocate space for message. If we run out of room (ridiculously * large message), we will elog(ERROR), but we want to discard the * message body so as not to lose communication sync. */ PG_TRY(); { enlargeStringInfo(s, len); } PG_CATCH(); { if (pq_discardbytes(len) == EOF) ereport(COMMERROR, (errcode(ERRCODE_PROTOCOL_VIOLATION), errmsg("incomplete message from client"))); ... What this code intends to handle is the case where the client has sent a message that is so long that the backend hasn't enough memory to buffer it. What's actually happened, more likely, is that the received message length is corrupt and just appears to be large, since the client-side trace shows that libpq has sent what it has to send and is now waiting for a reply. If the received length were correct then the pq_discardbytes call would have completed after eating the message. So what it looks like is that something is corrupting data on its way from the client to the server. Flaky firewall maybe? If you're using SSL, maybe an SSL library bug? I'm reduced to speculation at this point. It's hard even to say what to do to gather more info. If you could reproduce it then I'd suggest watching the connection with wireshark or the like to see what data is actually going across the wire ... but IIUC it's pretty random, so that approach seems unlikely to work. If you're in a position to run a modified server, you could try inserting a debug log message: } PG_CATCH(); { + elog(COMMERROR, "bogus received message length: %d", len); if (pq_discardbytes(len) == EOF) ereport(COMMERROR, (This is in src/backend/libpq/pqcomm.c, around line 1300 as of HEAD.) While this seems unlikely to teach us a huge amount, perhaps the value of the incorrect length would be informative. Are you always seeing this error at the exact same place so far as the client side is concerned? It's hard to see why a transport-level problem would preferentially affect one particular query ... regards, tom lane