Re: Question on slow fallocate

Andres Freund <andres@xxxxxxxxxxx> · Wed, 19 Jul 2023 13:29:17 -0700

Hi,

On 2023-07-19 17:25:37 +1000, Dave Chinner wrote:
> On Tue, Jul 11, 2023 at 03:49:11PM -0700, Andres Freund wrote:
> > The goal is to avoid ENOSPC at a later time. We do this before filling our own
> > in-memory buffer pool with pages containing new contents. If we have dirty
> > pages in our buffer that we can't write out due to ENOSPC, we're in trouble,
> > because we can't checkpoint. Which typically will make the ENOSPC situation
> > worse, because we also can't remove WAL / journal files without the checkpoint
> > having succeeded.  Of course a successful fallocate() / pwrite() doesn't
> > guarantee that much on a COW filesystem, but there's not much we can do about
> > that, to my knowledge.
> 
> Yup, which means you're screwed on XFS, ZFS and btrfs right now, and
> also bcachefs when people start using it.

I'd be happy to hear of a better alternative... fallocate() should avoid
ENOSPC on XFS unless snapshots trigger COW on a write, correct?

> > Using fallocate() for small extensions is problematic because it a) causes
> > We're also working on using DIO FWIW, where using fallocate() is just about
> > mandatory...
> 
> No, no it isn't. fallocate() is even more important to avoid with
> DIO than buffered IO because fallocate() completely serialises *all*
> IO to the file. That's the last thing you want with DIO given the
> only reason for using DIO is to maximising IO concurrency and
> minimise IO latency to individual files.

Not using any form of preallocation (potentially via extent size hints as you
mention below), when multiple files are being appended to simultaneously with
DIO, does lead to terrifying levels of fragmentation on xfs.

On a newly initialized xfs (mkfs.xfs version 6.3.0, 6.5.0-rc2):

rm -f fragtest-* && fio --minimal --name fragtest-1 --buffered=0 --filesize=128MB --fallocate=none --rw write --bs=$((4096*4)) --nrfiles=10

filefrag fragtest-1.0.*

fragtest-1.0.1: 8192 extents found
fragtest-1.0.2: 8192 extents found
fragtest-1.0.3: 8192 extents found
fragtest-1.0.4: 8192 extents found
fragtest-1.0.5: 8192 extents found
fragtest-1.0.6: 8192 extents found
fragtest-1.0.7: 8192 extents found
fragtest-1.0.8: 8192 extents found
fragtest-1.0.9: 8192 extents found

On a more "aged" filesystem, it's not quite as regular, but still above 7k
extents for all files.  Similarly, if I use io_uring for more concurrent IOs,
there's a bit less fragmentation, presumbly because sometimes two IOs for the
same file happen in subsequently.

Of course just writing four blocks at a time is a bit extreme, I wanted to
showcase the issue here, but even with a bit bigger writes, the problem is
still severe.  Writing multiple files at the same time is extremely common for
us (think of table and its indexes, or multiple partitions of a table being
filled concurrently).

It looks to me that with a single file being written, each write only
allocates a small extent, but the extent can be extended in subsequent
writes. But when 2+ files are being written, that rarely is possible, because
the space was already used for the other file(s).

> If you want to minimise fragmentation with DIO workloads, then you
> should be using extent size hints of an appropriate size. That will
> align and size extents to the hint regardless of fallocate/write
> ranges, hence this controls worst case fragmentation effectively.

That might be an option, but I'm not sure how realistic it is. Lookes like one
can't adjust the extsize for a file with existing contents, if I see this
correctly. We don't know what data will be how large ahead of time, so we
can't just configure a large extsize and be done with that.

Given the above fragmentation behaviour, and the fact that extsizes can't be
adjusted, I don't really see how we can get away from using fallocate() to
avoid fragmentation.

Then there's also the issue of extsize being xfs specific, without
corresponding fetures in other filesystems...

> If you want enospc guarantees for future writes, then large,
> infrequent fallocate(FALLOC_FL_KEEP_SIZE) calls should be used. Do
> not use this mechanism as an anti-fragmentation mechanism, that's
> what extent size hints are for.

Is there documentation about extent size hints anywhere beyond the paragraphs
in the ioctl_xfs_fsgetxattr(2)? I didn't find much...

> Use fallocate() as *little as possible*.
> 
> In my experience, fine grained management of file space by userspace
> applications via fallocate() is nothing but a recipe for awful
> performance, highly variable IO latency, bad file fragmentation, and
> poor filesystem aging characteristics. Just don't do it.

I'd like to avoid it, but so far experience has shown that that causes plenty
issues as well.

Somewhat tangential: I still would like a fallocate() option that actually
zeroes out new extents (via "write zeroes", if supported), rather than just
setting them up as unwritten extents. Nor for "data" files, but for
WAL/journal files.

Unwrittent extent "conversion", or actually extending the file, makes durable
journal writes via O_DSYNC or fdatasync() unusably slow. So one has to
overwrite the file with zeroes "manually" - even though "write zeroes" would
often be more efficient.

rm -f durable-*;fio --buffered=0 --filesize=32MB --fallocate=1 --rw write --bs=$((8192)) --nrfiles=1 --ioengine io_uring --iodepth 16 --sync dsync --name durable-overwrite --overwrite 1 --name durable-nooverwrite --overwrite 0 --stonewall --name durable-nofallocate --overwrite 0 --fallocate 0 --stonewall

slow-ish nvme:

Run status group 0 (all jobs):
  WRITE: bw=45.1MiB/s (47.3MB/s), 45.1MiB/s-45.1MiB/s (47.3MB/s-47.3MB/s), io=32.0MiB (33.6MB), run=710-710msec

Run status group 1 (all jobs):
  WRITE: bw=3224KiB/s (3302kB/s), 3224KiB/s-3224KiB/s (3302kB/s-3302kB/s), io=32.0MiB (33.6MB), run=10163-10163msec

Run status group 2 (all jobs):
  WRITE: bw=2660KiB/s (2724kB/s), 2660KiB/s-2660KiB/s (2724kB/s-2724kB/s), io=32.0MiB (33.6MB), run=12320-12320msec

fast nvme:

Run status group 0 (all jobs):
  WRITE: bw=1600MiB/s (1678MB/s), 1600MiB/s-1600MiB/s (1678MB/s-1678MB/s), io=32.0MiB (33.6MB), run=20-20msec

Run status group 1 (all jobs):
  WRITE: bw=356MiB/s (373MB/s), 356MiB/s-356MiB/s (373MB/s-373MB/s), io=32.0MiB (33.6MB), run=90-90msec

Run status group 2 (all jobs):
  WRITE: bw=260MiB/s (273MB/s), 260MiB/s-260MiB/s (273MB/s-273MB/s), io=32.0MiB (33.6MB), run=123-123msec

Greetings,

Andres Freund