Re: tmpfs fails fallocate(more than DRAM)

Hugh Dickins <hughd@xxxxxxxxxx> · Mon, 18 Feb 2019 19:35:01 -0800 (PST)

On Mon, 18 Feb 2019, Adam Borowski wrote:

> Hi Hugh, it turns out this problem is caused by your commit
> 1aac1400319d30786f32b9290e9cc923937b3d57:

Yes, part of the series which first enabled fallocate() on tmpfs.
You probably read most of them already, but if not, please do read
through those v3.5 commit comments on

e2d12e22c59c tmpfs: support fallocate preallocation
1635f6a74152 tmpfs: undo fallocation on failure
1aac1400319d tmpfs: quit when fallocate fills memory

where I said more about the awkward compromises made
than I would be able to bring back to mind today.

> 
> On Mon, Feb 18, 2019 at 02:34:23PM +0100, Adam Borowski wrote:
> > There's something that looks like a bug in tmpfs' implementation of
> > fallocate.  If you try to fallocate more than the available DRAM (yet
> > with plenty of swap space), it will evict everything swappable out
> > then fail, undoing all the work done so far first.
> > 
> > The returned error is ENOMEM rather than POSIX mandated ENOSPC (for
> > posix_allocate(), but our documentation doesn't mention ENOMEM for
> > Linux-specific fallocate() either).

I can't speak for UNIX and its other relations, but it's well established
on Linux that the absence of a listed errno from the POSIX manpage or our
own manpage is no guarantee that that errno will not be returned by the
system call in question.  Those lists are really helpful for documenting
a variety of special meanings, but don't expect them to cover everything.

(Though I see that I was relieved to find EINTR given in the manpage.)

And as Matthew already said, ENOMEM is one that can very easily come back
from many system calls.  Though I disagree that it's wrong here: ENOSPC
is the errno you get when your fallocate() reaches the block limit (if
any) of the filesystem, ENOMEM is one you may hit earlier if it's unable
to complete the fallocate() successfully with the memory currently
available.

Fallocate is not the only place where tmpfs has to make that distinction:
ENOSPC for the filesystem constraint, ENOMEM for running out of memory
(itself ambiguous: physical memory available? swap included? memcg limit?
memory overcommit limitation?).

> > 
> > Doing the same allocation in multiple calls -- be it via non-overlapping
> > calls or even with same offset but increasing len -- works as expected.

Its indeterminacy is the worst thing about it, I think. I suppose that
procedure will often work, because of each attempt pushing more out to
swap.  But I certainly agree that it's all an unsatisfactory compromise.

As I remark in one of those commit messages, I very much wish that
fallocate(2) had been defined to return a positive count on success,
to allow for partial success like write(2); but too late to change by
the time I came along.

> 
> I don't quite understand your logic there -- it seems to be done on purpose?
> 
> #   tmpfs: quit when fallocate fills memory
> #   
> #   As it stands, a large fallocate() on tmpfs is liable to fill memory with
> #   pages, freed on failure except when they run into swap, at which point
> #   they become fixed into the file despite the failure.  That feels quite
> #   wrong, to be consuming resources precisely when they're in short supply.
> 
> The page cache is just a cache, and thus running out of DRAM is in no way a
> failure (as long as there's enough underlying storage).  Like any other
> filesystem, once DRAM is full, tmpfs is supposed to start writeout.  A smart
> filesystem can mark zero pages as SWAP_MAP_FALLOC to avoid physically
> writing them out but doing so the naive hard way is at least correct.

I suggest below that we have different perceptions of tmpfs:
I see it as a RAM-based filesystem, with swap overflow; you see it
as a swap-based filesystem, caching in RAM.  I think that if it were
the latter, we'd have spent a lot more time designing its swap layout.

>     
> #   Go the other way instead: shmem_fallocate() indicate the range it has
> #   fallocated to shmem_writepage(), keeping count of pages it's allocating;
> #   shmem_writepage() reactivate instead of swapping out pages fallocated by
> #   this syscall (but happily swap out those from earlier occasions), keeping
> #   count; shmem_fallocate() compare counts and give up once the reactivated
> #   pages have started to coming back to writepage (approximately: some zones
> #   would in fact recycle faster than others).
> 
> It's a weird inconsistency: why should space allocated in a previous call
> act any different from that we allocate right now?

"weird" I'll agree with (and you're not the first person to use the word
"weird" of tmpfs in the last week!) but "inconsistency", in that context,
no.  Space allocated in a previous call has been guaranteed to the caller,
and that guarantee is a likely to be what they wanted fallocate() for in
the first place.  Space allocated right now, before we return success or
failure from the system call, is still revocable.

>     
> #   This is a little unusual, but works well: although we could consider the
> #   failure to swap as a bug, and fix it later with SWAP_MAP_FALLOC handling
> #   added in swapfile.c and memcontrol.c, I doubt that we shall ever want to.
> 
> It breaks use of tmpfs as a regular filesystem.  In particular, you don't
> know that a program someone uses won't try to create a big file.  For
> example, Debian buildds (where I first hit this problem) have setups such
> as:
> < jcristau> kilobyte: fwiw x86-csail-01.d.o has 75g /srv/buildd tmpfs, 8g ram, 89g swap
> 
> Using tmpfs this way is reasonable: traditional filesystems spend a lot of
> effort to ensure crash consistency, and even if you disable journaling and
> barriers, they will pointlessly write out the files.  Most builds can
> succeed in far less than 8GB, not touching the disk even once.

Yes, unsatisfactory: I tried for the best compromise I could imagine.
fallocate() on tmpfs remains useful in most circumstances, but with
this peculiar failure mode once going beyond RAM and well into swap.

With that 8G/89G split, I think you perceive tmpfs as a swap-based
filesystem, whereas I perceive it as a RAM-based filesystem which uses
swap for overflow; so made compromises appropriate to that view.

> 
> [...]
> 
> > This raises multiple questions:
> > * why would fallocate bother to prefault the memory instead of just
> >   reserving it?  We want to kill overcommit, but reserving swap is as good
> >   -- if there's memory pressure, our big allocation will be evicted anyway.

The only way I know of to reserve memory, respecting all the different
limiting mechanisms imposed (memcg limits, filesystem limits, zone
watermarks, ...), is to allocate it (not sure what you mean by prefault).
hugetlbfs does have a reservation system, and its very own pool of memory,
but that's not tmpfs.

> 
> I see that this particular feature is not coded yet for swap.

I expect you're right, but I don't see what you're referring to there:
ah, probably the SWAP_MAP_FALLOC mentioned above, from a comment in
shmem_writepage().  Yes, not implemented: it would handle a rare case
more efficiently, but I don't think it would change the fundamentals
at all.  Or maybe it's too long since I thought through this area,
and it really would make a real difference - dunno.

> 
> > * why does it insist on doing everything in one piece?  Biggest chunk I
> >   see to be beneficial is 1G (for hugepages).

It insists on attempting to do what you ask: if you ask for one big piece,
that's what it tries for.

> 
> At the moment, a big fallocate evicts all other swappable pages.  Doing it
> piece by piece would at least allow swapping out memory it just allocated
> (if we don't yet have a way to mark it up without physically writing
> zeroes).
> 
> > * when it fails, why does it undo the work done so far?  This can matter
> >   for other reasons, such as EINTR -- and fallocate isn't expected to be
> >   atomic anyway.
> 
> I searched a bit for references that would suggest failed fallocates need to
> be undone, and I can't seem to find any.  Neither POSIX nor our man pages
> say a word about semantics of interrupted fallocate, and both glibc's and
> FreeBSD's fallback emulation don't rollback.

To me it was self-evident: with a few awkward exceptions (awkward because
they would have a difficult job to undo, and awkward because they argue
against me!), a system call either succeeds or fails, or reports partial
success.  If fallocate() says it failed (and is not allowed to report
partial success), then it should not have allocated.  Especially in the
case of RAM, when filling it up makes it rather hard to unfill (another
persistent problem with tmpfs is the way it can occupy all of memory,
and the OOM killer go about killing a thousand processes, but none of
them help because the memory is occupied by a tmpfs, not by a process).

Now that you question it (did I not do so at the time? I thought I did),
I try fallocate() on btrfs and ext4 and xfs.  btrfs and xfs behave as I
expect above, failing outright with ENOSPC if it will not fit; whereas
ext4 proceeds to fill up the filesystem, leaving it full when it says
that it failed.  Looks like I had a choice of models to follow: the
ext4 model would have been easier to follow, but risked OOM.

> 
> But, as my understanding seems to go nearly the opposite way as your commit
> message, am I getting it wrong?  It's you not me who's a mm regular...
> 
> 
> Meow!
> -- 
> ⢀⣴⠾⠻⢶⣦⠀
> ⣾⠁⢠⠒⠀⣿⡁
> ⢿⡄⠘⠷⠚⠋⠀ Have you accepted Khorne as your lord and saviour?

Actually, no.  Would s/he have a useful insight to share on fallocate()?

Hugh