Re: Question on slow fallocate

Andres Freund <andres@xxxxxxxxxxx> · Tue, 11 Jul 2023 15:28:05 -0700

Hi,

On 2023-06-23 10:47:43 +1000, Dave Chinner wrote:
> On Thu, Jun 22, 2023 at 02:34:18PM +0900, Masahiko Sawada wrote:
> > Hi all,
> >
> > When testing PostgreSQL, I found a performance degradation. After some
> > investigation, it ultimately reached the attached simple C program and
> > turned out that the performance degradation happens on only the xfs
> > filesystem (doesn't happen on neither ext3 nor ext4). In short, the
> > program alternately does two things to extend a file (1) call
> > posix_fallocate() to extend by 8192 bytes
>
> This is a well known anti-pattern - it always causes problems. Do
> not do this.

Postgres' actual behaviour is more complicated than what Sawada-san's test.
We either fallocate() multiple pages or we use use pwritev() to extend by
fewer pages.

I think Sawada-san wrote it when trying to narrow down a performance issue to
the "problematic" interaction, perhaps simplifying the real workload too much.

> As it is, using fallocate/pwrite like test does is a well known
> anti-pattern:
>
> 	error = fallocate(fd, off, len);
> 	if (error == ENOSPC) {
> 		/* abort write!!! */
> 	}
> 	error = pwrite(fd, off, len);
> 	ASSERT(error != ENOSPC);
> 	if (error) {
> 		/* handle error */
> 	}
>
> Why does the code need a call to fallocate() here it prevent ENOSPC in the
> pwrite() call?

The reason we do need either fallocate or pwrite is to ensure we can later
write out the page from postgres' buffer pool without hitting ENOSPC (of
course that's still not reliable for all filesystems...).  We don't want to
use *write() for larger amounts of data, because that ends up with the kernel
actually needing to write out those pages. There never is any content in those
extended pages.

So for small file extensions we use writes, and when it's more bulk work, we
use fallocate.

Having a dirty page in our buffer pool is, that we can't write out due to
ENOSPC, is bad, as that prevents our checkpoints from ever succeeding. Thus we
either need to "crash" and replay the journal, or we can't checkpoint, with
all the issues that entails.

The performance issue at hand came to be because of the workload flipping
between extending by fallocate() and extending by write(), as part of the
heuristic is the contention on the lock protecting file extensions.

Greetings,

Andres Freund