Re: Question on slow fallocate

Masahiko Sawada <sawada.mshk@xxxxxxxxx> · Mon, 26 Jun 2023 12:17:50 +0900

On Fri, Jun 23, 2023 at 9:47 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> On Thu, Jun 22, 2023 at 02:34:18PM +0900, Masahiko Sawada wrote:
> > Hi all,
> >
> > When testing PostgreSQL, I found a performance degradation. After some
> > investigation, it ultimately reached the attached simple C program and
> > turned out that the performance degradation happens on only the xfs
> > filesystem (doesn't happen on neither ext3 nor ext4). In short, the
> > program alternately does two things to extend a file (1) call
> > posix_fallocate() to extend by 8192 bytes
>
> This is a well known anti-pattern - it always causes problems. Do
> not do this.
>
> > and (2) call pwrite() to
> > extend by 8192 bytes. If I do only either (1) or (2), the program is
> > completed in 2 sec, but if I do (1) and (2) alternatively, it is
> > completed in 90 sec.
>
> Well, yes. Using fallocate to extend the file has very different
> constraints to using pwrite to extend the file.
>
> > $ gcc -o test test.c
> > $ time ./test test.1 1
> > total   200000
> > fallocate       200000
> > filewrite       0
>
> No data is written here, so this is just a series of 8kB allocations
> and file size extension operations. There are no constraints here
> because it is a pure metadata operation.
>
> > real    0m1.305s
> > user    0m0.050s
> > sys     0m1.255s
> >
> > $ time ./test test.2 2
> > total   200000
> > fallocate       100000
> > filewrite       100000
> >
> > real    1m29.222s
> > user    0m0.139s
> > sys     0m3.139s
>
> Now we have fallocate extending the file and doing unwritten extent
> allocation, followed by writing into that unwritten extent which
> then does unwritten extent conversion.
>
> This introduces data vs metadata update ordering constraints to the
> workload.
>
> The problem here in that the "truncate up" operation that
> fallocate is doing to move the file size. The "truncate up" is going
> to move the on-disk file size to the end of the fallocated range via
> a journal transaction, and so it will expose the range of the
> previous write as containing valid data.
>
> However, the previous data write is still only in memory and not on
> disk. The result of journalling the file size change is that if we
> crash after the size change is made but the data is not on disk,
> we end up with lost data - the file contains zeros (or NULLs) where
> the in memory data previously existed.
>
> Go google for "NULL file data exposure" and you'll see this is a
> problem we fixed in ~2006, caused by extending the file size on disk
> without first having written all the in-memory data into the file.
> And even though we fixed the problem over 15 years ago, we still
> hear people today saying "XFS overwrites user data with NULLs!" as
> their reason for never using XFS, even though this was never true in
> the first place..
>
> The result of users demanding that we prevent poorly written
> applications from losing their data is that users get poor
> performance when their applications are poorly written. i.e. they do
> something that triggers the data integrity ordering constraints that
> users demand we work within.

Thank you for the detailed explanation.

>
> So, how to avoid the problem?
>
> With 'posix_fallocate(fd, total_len, 8192);':
>
> $ rm /mnt/scratch/foo ; time ./fwtest /mnt/scratch/foo 1
> total   200000
> fallocate       200000
> filewrite       0
>
> real    0m2.557s
> user    0m0.025s
> sys     0m2.531s
> $ rm /mnt/scratch/foo ; time ./fwtest /mnt/scratch/foo 2
> total   200000
> fallocate       100000
> filewrite       100000
>
> real    0m39.564s
> user    0m0.117s
> sys     0m7.535s
>
>
> With 'fallocate(fd, FALLOC_FL_KEEP_SIZE, total_len, 8192);':
>
> $ rm /mnt/scratch/foo ; time ./fwtest /mnt/scratch/foo 1
> total   200000
> fallocate       200000
> filewrite       0
>
> real    0m2.269s
> user    0m0.037s
> sys     0m2.233s
> $ rm /mnt/scratch/foo ; time ./fwtest /mnt/scratch/foo 2
> total   200000
> fallocate       100000
> filewrite       100000
>
> real    0m1.068s
> user    0m0.028s
> sys     0m1.040s
>
> Yup, just stop fallocate() from extending the file size and leave
> that to the pwrite() call that actually writes the data into the
> file.
>
> As it is, using fallocate/pwrite like test does is a well known
> anti-pattern:
>
>         error = fallocate(fd, off, len);
>         if (error == ENOSPC) {
>                 /* abort write!!! */
>         }
>         error = pwrite(fd, off, len);
>         ASSERT(error != ENOSPC);
>         if (error) {
>                 /* handle error */
>         }
>

The test.c and what PostgreSQL does are slightly different from the
above pattern actually: it calls fallocate and pwrites for different
8kB blocks. For example, it calls fallocate to extend the file from 0
byte to 8192 bytes, and then calls pwrite to extend the file from 8192
bytes to 16384 bytes. But it's also not a recommended use, right?

> Why does the code need a call to fallocate() here it prevent ENOSPC in the
> pwrite() call?
>
> The answer here is that it *doesn't need to use fallocate() here*.
> THat is, the fallocate() ENOSPC check before the space is allocated
> is exactly the same as the ENOSPC check done in the pwrite() call to
> see if there is space for the write to proceed.
>
> IOWs, the fallocate() call is *completely redundant*, yet it is
> actively harmful to performance in the short term (as per the
> issue in this thread) as well as being harmful for file fragmentation
> levels and filesystem longevity because it prevents the filesystem
> from optimising away unnecessary allocations. i.e. it defeats
> delayed allocation which allows filesystem to combine lots of
> small sequential write() calls in a single big contiguous extent
> allocation when the data is getting written to disk.
>
> IOWs, using fallocate() in the way described in this test is a sign
> of applicaiton developers not understanding what preallocation
> actually does and the situations where it actually provides some
> kinds of benefit.
>
> i.e. fallocate() is intended to allow applications to preallocate
> space in large chunks long before it is needed, and still have it
> available when the application actually needs to write to it. e.g.
> preallocate 10MB at a time, not have to run fallocate again until the
> existing preallocated chunk is entirely used up by the next thousand
> 8KB writes that extend the file.
>
> Using fallocate() as a replacement for "truncate up before write" is
> *not a recommended use*.

FYI, to share the background of what PostgreSQL does, when
bulk-insertions into one table are running concurrently, one process
extends the underlying files depending on how many concurrent
processes are waiting to extend. The more processes wait, the more 8kB
blocks are appended. As the current implementation, if the process
needs to extend the table by more than 8 blocks (i.e. 64kB) it uses
posix_fallocate(), otherwise it uses pwrites() (see the code[1] for
details). We don't use fallocate() for small extensions as it's slow
on some filesystems. Therefore, if a bulk-insertion process tries to
extend the table by say 5~10 blocks many times, it could use
poxis_fallocate() and pwrite() alternatively, which led to the slow
performance as I reported.

>
> > Why does it take so long in the latter case? and are there any
> > workaround or configuration changes to deal with it?
>
> Let pwrite() do the file extension because it natively handles data
> vs metadata ordering without having to flush data to disk and wait
> for it. i.e. do not use fallocate() as if it is ftruncate(). Also,
> do not use posix_fallocate() - it gives you no control over how
> preallocation is done, use fallocate() directly. And if you must use
> fallocate() before a write, use fallocate(fd, FALLOC_FL_KEEP_SIZE,
> off, len) so that the file extension is done by the pwrite() to
> avoid any metadata/data ordering constraints that might exist with
> non-data write related file size changes.
>

Thanks. Wang Yugui reported that this slow performance seems not to
happen on newer kernel versions, but is that right?

Fortunately, this behavior is still beta (PG16 beta). I will discuss
alternative solutions in the PostgreSQL community.

Regards,

[1] https://github.com/postgres/postgres/blob/master/src/backend/storage/smgr/md.c#L577

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com