Re: tmpfs fails fallocate(more than DRAM)

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 19 Feb 2019 15:16:50 +1100

On Mon, Feb 18, 2019 at 07:35:01PM -0800, Hugh Dickins wrote:
> On Mon, 18 Feb 2019, Adam Borowski wrote:
> > I searched a bit for references that would suggest failed fallocates need to
> > be undone, and I can't seem to find any.  Neither POSIX nor our man pages
> > say a word about semantics of interrupted fallocate, and both glibc's and
> > FreeBSD's fallback emulation don't rollback.
> 
> To me it was self-evident: with a few awkward exceptions (awkward because
> they would have a difficult job to undo, and awkward because they argue
> against me!), a system call either succeeds or fails, or reports partial
> success.  If fallocate() says it failed (and is not allowed to report
> partial success), then it should not have allocated.  Especially in the
> case of RAM, when filling it up makes it rather hard to unfill (another
> persistent problem with tmpfs is the way it can occupy all of memory,
> and the OOM killer go about killing a thousand processes, but none of
> them help because the memory is occupied by a tmpfs, not by a process).
> 
> Now that you question it (did I not do so at the time? I thought I did),
> I try fallocate() on btrfs and ext4 and xfs.  btrfs and xfs behave as I
> expect above, failing outright with ENOSPC if it will not fit;

If only it were that simple. :/

XFS can do partial allocation and fail - it all depends on how many
extent allocations are required before ENOSPC is actually hit. e.g.
if you ask for 10GB and there is only 5GB free, it should fail
straight away. However, if there's 20GB free in 1GB chunks, it will
loop allocating 1GB extents. If something else is allocating at the
same time, the fallocate could get to, say, 8GB allocated and then
hit ENOSPC.

In which case, we'll return the ENOSPC error, but we'll also leave
the 8GB of space already allocated to the file there. i.e. it
doesn't clean up after itself.

The reason for this is that we don't know after we've performed
allocations what regions of the preallocated range were actually
allocated by the preallocation. i.e. fallocate can be run over a
range that already contains some extents - it simply skips over
regions that are already allocated. hence we don't know what we are
supposed to clean up, and so we leave the corpse lying around for
someone else to deal with (e.g. by sparsifying the file again).

> whereas
> ext4 proceeds to fill up the filesystem, leaving it full when it says
> that it failed.

This is much the same behaviour as XFS - you see it more easily with
ext4 because it has much smaller maximum extent size (128MB) than
XFS (8GB) and so needs to iterate multiple allocations sooner than
XFS or btrfs need to.

I'm not sure what btrfs does

> Looks like I had a choice of models to follow: the
> ext4 model would have been easier to follow, but risked OOM.

fallocate() gives you the rope to choose what is best for the
filesystem - it doesn't specify behaviour on failure precisely
because it can be very difficult (not to mention complex!) for
filesystems to unwind partial failures....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx