On 2019-07-27 14:43:54 +0300, Artemiy Ryabinkov wrote: > Why backend send buffer use exactly 8KB? (https://github.com/postgres/postgres/blob/249d64999615802752940e017ee5166e726bc7cd/src/backend/libpq/pqcomm.c#L134) > > > I had this question when I try to measure the speed of reading data. The > bottleneck was a read syscall. With strace I found that in most cases read > returns 8192 bytes (https://pastebin.com/LU10BdBJ). With tcpdump we can > confirm, that network packets have size 8192 (https://pastebin.com/FD8abbiA) Well, in most setups, you can't have that large frames. The most common limit is 1500 +- some overheads. Using jumbo frames isn't that uncommon, but it has enough problems that I don't think it's that widely used with postgres. > So, with well-tuned networking stack, the limit is 8KB. The reason is the > hardcoded size of Postgres write buffer. Well, jumbo frames are limited to 9000 bytes. But the reason you're seeing 8192 sized packages isn't just that we have an 8kb buffer, I think it's also that that we unconditionally set TCP_NODELAY: #ifdef TCP_NODELAY on = 1; if (setsockopt(port->sock, IPPROTO_TCP, TCP_NODELAY, (char *) &on, sizeof(on)) < 0) { elog(LOG, "setsockopt(%s) failed: %m", "TCP_NODELAY"); return STATUS_ERROR; } #endif With 8KB send size, we'll often unnecessarily send some smaller packets (both for 1500 and 9000 MTUs), because 8kB doesn't neatly divide into the MTU. Here's e.g. the ip packet sizes for a query returning maybe 18kB: 1500 1500 1500 1500 1500 1004 1500 1500 1500 1500 1500 1004 1500 414 the dips are because that's where our 8KB buffer + disabling nagle implies a packet boundary. I wonder if we ought to pass MSG_MORE (which overrides TCP_NODELAY by basically having TCP_CORK behaviour for that call) in cases we know there's more data to send. Which we pretty much know, although we'd need to pass that knowledge from pqcomm.c to be-secure.c It might be better to just use larger send sizes however. I think most kernels are going to be better than us knowing how to chop up the send size. We're using much larger limits when sending data from the client (no limit for !win32, 65k for windows), and I don't recall seeing any problem reports about that. OTOH, I'm not quite convinced that you're going to see much of a performance difference in most scenarios. As soon as the connection is actually congested, the kernel will coalesce packages regardless of the send() size. > Does it make sense to make this parameter configurable? I'd much rather not. It's goign to be too hard to tune, and I don't see any tradeoffs actually requiring that. Greetings, Andres Freund