El día lunes, abril 27, 2020 a las 09:40:39a. m. -0400, Tom Lane escribió: > Matthias Apitz <guru@xxxxxxxxxxx> writes: > > El día Montag, April 27, 2020 a las 08:40:04 -0400, Tom Lane escribió: > >> Can you get a stack trace from the connected backend? > > > (gdb) bt > > #0 0x00007fd567776000 in epoll_pwait () from /lib64/libc.so.6 > > #1 0x000000000084476c in WaitEventSetWaitBlock () > > #2 0x0000000000844647 in WaitEventSetWait () > > #3 0x00000000006f89d2 in secure_read () > > #4 0x0000000000707425 in pq_recvbuf () > > #5 0x0000000000707709 in pq_discardbytes () > > #6 0x0000000000707aba in pq_getmessage () > > #7 0x000000000086b478 in SocketBackend () > > #8 0x000000000086b4c4 in ReadCommand () > > #9 0x000000000086fda9 in PostgresMain () > > Oh, that is *very* interesting, because there is only one caller of > pq_discardbytes: > > /* > * Allocate space for message. If we run out of room (ridiculously > * large message), we will elog(ERROR), but we want to discard the > * message body so as not to lose communication sync. > */ > PG_TRY(); > { > enlargeStringInfo(s, len); > } > PG_CATCH(); > { > if (pq_discardbytes(len) == EOF) > ereport(COMMERROR, > (errcode(ERRCODE_PROTOCOL_VIOLATION), > errmsg("incomplete message from client"))); > ... > > What this code intends to handle is the case where the client has sent a > message that is so long that the backend hasn't enough memory to buffer > it. What's actually happened, more likely, is that the received message > length is corrupt and just appears to be large, since the client-side > trace shows that libpq has sent what it has to send and is now waiting for > a reply. If the received length were correct then the pq_discardbytes > call would have completed after eating the message. > > So what it looks like is that something is corrupting data on its > way from the client to the server. Flaky firewall maybe? If you're > using SSL, maybe an SSL library bug? I'm reduced to speculation at > this point. It's hard even to say what to do to gather more info. > If you could reproduce it then I'd suggest watching the connection > with wireshark or the like to see what data is actually going across > the wire ... but IIUC it's pretty random, so that approach seems > unlikely to work. > > If you're in a position to run a modified server, you could try > inserting a debug log message: > > } > PG_CATCH(); > { > + elog(COMMERROR, "bogus received message length: %d", len); > if (pq_discardbytes(len) == EOF) > ereport(COMMERROR, > > (This is in src/backend/libpq/pqcomm.c, around line 1300 as of HEAD.) > While this seems unlikely to teach us a huge amount, perhaps the > value of the incorrect length would be informative. > > Are you always seeing this error at the exact same place so far as > the client side is concerned? It's hard to see why a transport-level > problem would preferentially affect one particular query ... I've added the printout of the length in this case and another one, and see this in the server's log file: 2020-05-04 10:05:49.977 CEST [32092] LOG: invalid length 33554940 of startup packet 2020-05-04 10:05:50.207 CEST [32094] LOG: invalid length 33554940 of startup packet 2020-05-04 12:32:50.781 CEST [20334] LOG: bogus received message length: 1650422894 2020-05-04 12:36:41.293 CEST [20380] LOG: bogus received message length: 1650422894 2020-05-04 12:39:39.461 CEST [20441] LOG: bogus received message length: 1650422894 2020-05-04 13:01:50.566 CEST [24222] LOG: bogus received message length: 1650422894 Any comments? Thanks matthias -- Matthias Apitz, ✉ guru@xxxxxxxxxxx, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub