Hi, On 2023-06-23 10:47:43 +1000, Dave Chinner wrote: > On Thu, Jun 22, 2023 at 02:34:18PM +0900, Masahiko Sawada wrote: > > Hi all, > > > > When testing PostgreSQL, I found a performance degradation. After some > > investigation, it ultimately reached the attached simple C program and > > turned out that the performance degradation happens on only the xfs > > filesystem (doesn't happen on neither ext3 nor ext4). In short, the > > program alternately does two things to extend a file (1) call > > posix_fallocate() to extend by 8192 bytes > > This is a well known anti-pattern - it always causes problems. Do > not do this. Postgres' actual behaviour is more complicated than what Sawada-san's test. We either fallocate() multiple pages or we use use pwritev() to extend by fewer pages. I think Sawada-san wrote it when trying to narrow down a performance issue to the "problematic" interaction, perhaps simplifying the real workload too much. > As it is, using fallocate/pwrite like test does is a well known > anti-pattern: > > error = fallocate(fd, off, len); > if (error == ENOSPC) { > /* abort write!!! */ > } > error = pwrite(fd, off, len); > ASSERT(error != ENOSPC); > if (error) { > /* handle error */ > } > > Why does the code need a call to fallocate() here it prevent ENOSPC in the > pwrite() call? The reason we do need either fallocate or pwrite is to ensure we can later write out the page from postgres' buffer pool without hitting ENOSPC (of course that's still not reliable for all filesystems...). We don't want to use *write() for larger amounts of data, because that ends up with the kernel actually needing to write out those pages. There never is any content in those extended pages. So for small file extensions we use writes, and when it's more bulk work, we use fallocate. Having a dirty page in our buffer pool is, that we can't write out due to ENOSPC, is bad, as that prevents our checkpoints from ever succeeding. Thus we either need to "crash" and replay the journal, or we can't checkpoint, with all the issues that entails. The performance issue at hand came to be because of the workload flipping between extending by fallocate() and extending by write(), as part of the heuristic is the contention on the lock protecting file extensions. Greetings, Andres Freund