Re: tmpfs fails fallocate(more than DRAM)

Adam Borowski <kilobyte@xxxxxxxxxx> · Mon, 18 Feb 2019 21:25:34 +0100

Hi Hugh, it turns out this problem is caused by your commit
1aac1400319d30786f32b9290e9cc923937b3d57:

On Mon, Feb 18, 2019 at 02:34:23PM +0100, Adam Borowski wrote:
> There's something that looks like a bug in tmpfs' implementation of
> fallocate.  If you try to fallocate more than the available DRAM (yet
> with plenty of swap space), it will evict everything swappable out
> then fail, undoing all the work done so far first.
> 
> The returned error is ENOMEM rather than POSIX mandated ENOSPC (for
> posix_allocate(), but our documentation doesn't mention ENOMEM for
> Linux-specific fallocate() either).
> 
> Doing the same allocation in multiple calls -- be it via non-overlapping
> calls or even with same offset but increasing len -- works as expected.

I don't quite understand your logic there -- it seems to be done on purpose?

#   tmpfs: quit when fallocate fills memory
#   
#   As it stands, a large fallocate() on tmpfs is liable to fill memory with
#   pages, freed on failure except when they run into swap, at which point
#   they become fixed into the file despite the failure.  That feels quite
#   wrong, to be consuming resources precisely when they're in short supply.

The page cache is just a cache, and thus running out of DRAM is in no way a
failure (as long as there's enough underlying storage).  Like any other
filesystem, once DRAM is full, tmpfs is supposed to start writeout.  A smart
filesystem can mark zero pages as SWAP_MAP_FALLOC to avoid physically
writing them out but doing so the naive hard way is at least correct.

#   Go the other way instead: shmem_fallocate() indicate the range it has
#   fallocated to shmem_writepage(), keeping count of pages it's allocating;
#   shmem_writepage() reactivate instead of swapping out pages fallocated by
#   this syscall (but happily swap out those from earlier occasions), keeping
#   count; shmem_fallocate() compare counts and give up once the reactivated
#   pages have started to coming back to writepage (approximately: some zones
#   would in fact recycle faster than others).

It's a weird inconsistency: why should space allocated in a previous call
act any different from that we allocate right now?

#   This is a little unusual, but works well: although we could consider the
#   failure to swap as a bug, and fix it later with SWAP_MAP_FALLOC handling
#   added in swapfile.c and memcontrol.c, I doubt that we shall ever want to.

It breaks use of tmpfs as a regular filesystem.  In particular, you don't
know that a program someone uses won't try to create a big file.  For
example, Debian buildds (where I first hit this problem) have setups such
as:
< jcristau> kilobyte: fwiw x86-csail-01.d.o has 75g /srv/buildd tmpfs, 8g ram, 89g swap

Using tmpfs this way is reasonable: traditional filesystems spend a lot of
effort to ensure crash consistency, and even if you disable journaling and
barriers, they will pointlessly write out the files.  Most builds can
succeed in far less than 8GB, not touching the disk even once.

[...]

> This raises multiple questions:
> * why would fallocate bother to prefault the memory instead of just
>   reserving it?  We want to kill overcommit, but reserving swap is as good
>   -- if there's memory pressure, our big allocation will be evicted anyway.

I see that this particular feature is not coded yet for swap.

> * why does it insist on doing everything in one piece?  Biggest chunk I
>   see to be beneficial is 1G (for hugepages).

At the moment, a big fallocate evicts all other swappable pages.  Doing it
piece by piece would at least allow swapping out memory it just allocated
(if we don't yet have a way to mark it up without physically writing
zeroes).

> * when it fails, why does it undo the work done so far?  This can matter
>   for other reasons, such as EINTR -- and fallocate isn't expected to be
>   atomic anyway.

I searched a bit for references that would suggest failed fallocates need to
be undone, and I can't seem to find any.  Neither POSIX nor our man pages
say a word about semantics of interrupted fallocate, and both glibc's and
FreeBSD's fallback emulation don't rollback.

But, as my understanding seems to go nearly the opposite way as your commit
message, am I getting it wrong?  It's you not me who's a mm regular...

Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ Have you accepted Khorne as your lord and saviour?
⠈⠳⣄⠀⠀⠀⠀