On Mon, Feb 18, 2019 at 07:35:01PM -0800, Hugh Dickins wrote: > On Mon, 18 Feb 2019, Adam Borowski wrote: > > I searched a bit for references that would suggest failed fallocates need to > > be undone, and I can't seem to find any. Neither POSIX nor our man pages > > say a word about semantics of interrupted fallocate, and both glibc's and > > FreeBSD's fallback emulation don't rollback. > > To me it was self-evident: with a few awkward exceptions (awkward because > they would have a difficult job to undo, and awkward because they argue > against me!), a system call either succeeds or fails, or reports partial > success. If fallocate() says it failed (and is not allowed to report > partial success), then it should not have allocated. Especially in the > case of RAM, when filling it up makes it rather hard to unfill (another > persistent problem with tmpfs is the way it can occupy all of memory, > and the OOM killer go about killing a thousand processes, but none of > them help because the memory is occupied by a tmpfs, not by a process). > > Now that you question it (did I not do so at the time? I thought I did), > I try fallocate() on btrfs and ext4 and xfs. btrfs and xfs behave as I > expect above, failing outright with ENOSPC if it will not fit; If only it were that simple. :/ XFS can do partial allocation and fail - it all depends on how many extent allocations are required before ENOSPC is actually hit. e.g. if you ask for 10GB and there is only 5GB free, it should fail straight away. However, if there's 20GB free in 1GB chunks, it will loop allocating 1GB extents. If something else is allocating at the same time, the fallocate could get to, say, 8GB allocated and then hit ENOSPC. In which case, we'll return the ENOSPC error, but we'll also leave the 8GB of space already allocated to the file there. i.e. it doesn't clean up after itself. The reason for this is that we don't know after we've performed allocations what regions of the preallocated range were actually allocated by the preallocation. i.e. fallocate can be run over a range that already contains some extents - it simply skips over regions that are already allocated. hence we don't know what we are supposed to clean up, and so we leave the corpse lying around for someone else to deal with (e.g. by sparsifying the file again). > whereas > ext4 proceeds to fill up the filesystem, leaving it full when it says > that it failed. This is much the same behaviour as XFS - you see it more easily with ext4 because it has much smaller maximum extent size (128MB) than XFS (8GB) and so needs to iterate multiple allocations sooner than XFS or btrfs need to. I'm not sure what btrfs does > Looks like I had a choice of models to follow: the > ext4 model would have been easier to follow, but risked OOM. fallocate() gives you the rope to choose what is best for the filesystem - it doesn't specify behaviour on failure precisely because it can be very difficult (not to mention complex!) for filesystems to unwind partial failures.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx