Hi, On 2023-07-19 17:25:37 +1000, Dave Chinner wrote: > On Tue, Jul 11, 2023 at 03:49:11PM -0700, Andres Freund wrote: > > The goal is to avoid ENOSPC at a later time. We do this before filling our own > > in-memory buffer pool with pages containing new contents. If we have dirty > > pages in our buffer that we can't write out due to ENOSPC, we're in trouble, > > because we can't checkpoint. Which typically will make the ENOSPC situation > > worse, because we also can't remove WAL / journal files without the checkpoint > > having succeeded. Of course a successful fallocate() / pwrite() doesn't > > guarantee that much on a COW filesystem, but there's not much we can do about > > that, to my knowledge. > > Yup, which means you're screwed on XFS, ZFS and btrfs right now, and > also bcachefs when people start using it. I'd be happy to hear of a better alternative... fallocate() should avoid ENOSPC on XFS unless snapshots trigger COW on a write, correct? > > Using fallocate() for small extensions is problematic because it a) causes > > We're also working on using DIO FWIW, where using fallocate() is just about > > mandatory... > > No, no it isn't. fallocate() is even more important to avoid with > DIO than buffered IO because fallocate() completely serialises *all* > IO to the file. That's the last thing you want with DIO given the > only reason for using DIO is to maximising IO concurrency and > minimise IO latency to individual files. Not using any form of preallocation (potentially via extent size hints as you mention below), when multiple files are being appended to simultaneously with DIO, does lead to terrifying levels of fragmentation on xfs. On a newly initialized xfs (mkfs.xfs version 6.3.0, 6.5.0-rc2): rm -f fragtest-* && fio --minimal --name fragtest-1 --buffered=0 --filesize=128MB --fallocate=none --rw write --bs=$((4096*4)) --nrfiles=10 filefrag fragtest-1.0.* fragtest-1.0.1: 8192 extents found fragtest-1.0.2: 8192 extents found fragtest-1.0.3: 8192 extents found fragtest-1.0.4: 8192 extents found fragtest-1.0.5: 8192 extents found fragtest-1.0.6: 8192 extents found fragtest-1.0.7: 8192 extents found fragtest-1.0.8: 8192 extents found fragtest-1.0.9: 8192 extents found On a more "aged" filesystem, it's not quite as regular, but still above 7k extents for all files. Similarly, if I use io_uring for more concurrent IOs, there's a bit less fragmentation, presumbly because sometimes two IOs for the same file happen in subsequently. Of course just writing four blocks at a time is a bit extreme, I wanted to showcase the issue here, but even with a bit bigger writes, the problem is still severe. Writing multiple files at the same time is extremely common for us (think of table and its indexes, or multiple partitions of a table being filled concurrently). It looks to me that with a single file being written, each write only allocates a small extent, but the extent can be extended in subsequent writes. But when 2+ files are being written, that rarely is possible, because the space was already used for the other file(s). > If you want to minimise fragmentation with DIO workloads, then you > should be using extent size hints of an appropriate size. That will > align and size extents to the hint regardless of fallocate/write > ranges, hence this controls worst case fragmentation effectively. That might be an option, but I'm not sure how realistic it is. Lookes like one can't adjust the extsize for a file with existing contents, if I see this correctly. We don't know what data will be how large ahead of time, so we can't just configure a large extsize and be done with that. Given the above fragmentation behaviour, and the fact that extsizes can't be adjusted, I don't really see how we can get away from using fallocate() to avoid fragmentation. Then there's also the issue of extsize being xfs specific, without corresponding fetures in other filesystems... > If you want enospc guarantees for future writes, then large, > infrequent fallocate(FALLOC_FL_KEEP_SIZE) calls should be used. Do > not use this mechanism as an anti-fragmentation mechanism, that's > what extent size hints are for. Is there documentation about extent size hints anywhere beyond the paragraphs in the ioctl_xfs_fsgetxattr(2)? I didn't find much... > Use fallocate() as *little as possible*. > > In my experience, fine grained management of file space by userspace > applications via fallocate() is nothing but a recipe for awful > performance, highly variable IO latency, bad file fragmentation, and > poor filesystem aging characteristics. Just don't do it. I'd like to avoid it, but so far experience has shown that that causes plenty issues as well. Somewhat tangential: I still would like a fallocate() option that actually zeroes out new extents (via "write zeroes", if supported), rather than just setting them up as unwritten extents. Nor for "data" files, but for WAL/journal files. Unwrittent extent "conversion", or actually extending the file, makes durable journal writes via O_DSYNC or fdatasync() unusably slow. So one has to overwrite the file with zeroes "manually" - even though "write zeroes" would often be more efficient. rm -f durable-*;fio --buffered=0 --filesize=32MB --fallocate=1 --rw write --bs=$((8192)) --nrfiles=1 --ioengine io_uring --iodepth 16 --sync dsync --name durable-overwrite --overwrite 1 --name durable-nooverwrite --overwrite 0 --stonewall --name durable-nofallocate --overwrite 0 --fallocate 0 --stonewall slow-ish nvme: Run status group 0 (all jobs): WRITE: bw=45.1MiB/s (47.3MB/s), 45.1MiB/s-45.1MiB/s (47.3MB/s-47.3MB/s), io=32.0MiB (33.6MB), run=710-710msec Run status group 1 (all jobs): WRITE: bw=3224KiB/s (3302kB/s), 3224KiB/s-3224KiB/s (3302kB/s-3302kB/s), io=32.0MiB (33.6MB), run=10163-10163msec Run status group 2 (all jobs): WRITE: bw=2660KiB/s (2724kB/s), 2660KiB/s-2660KiB/s (2724kB/s-2724kB/s), io=32.0MiB (33.6MB), run=12320-12320msec fast nvme: Run status group 0 (all jobs): WRITE: bw=1600MiB/s (1678MB/s), 1600MiB/s-1600MiB/s (1678MB/s-1678MB/s), io=32.0MiB (33.6MB), run=20-20msec Run status group 1 (all jobs): WRITE: bw=356MiB/s (373MB/s), 356MiB/s-356MiB/s (373MB/s-373MB/s), io=32.0MiB (33.6MB), run=90-90msec Run status group 2 (all jobs): WRITE: bw=260MiB/s (273MB/s), 260MiB/s-260MiB/s (273MB/s-273MB/s), io=32.0MiB (33.6MB), run=123-123msec Greetings, Andres Freund