Dynamically reserving swap space for MAP_NORESERVE mappings

David Hildenbrand <david@xxxxxxxxxx> · Fri, 12 Feb 2021 14:00:00 +0100

Hi,

I'm planning on making use of MAP_NORESERVE for sparse memory regions, 
but I still want to have some way to reduce the chance of running into 
random OOMs, similar to the ones we have with !MAP_NORESERVE on private 
mappings. I want dynamic reservations of swap space.

The rough idea is having a large mmap(MAP_NORESERVE) area in which I 
dynamically populate/discard memory to control the memory consumption, 
similar to a memory allocator - but rather in the context of dynamically 
resizing VMs. In case the user requests a dangerous configurations ("add 
50GB" instead of "add 5GB"), I rather want to fail in a nice way early 
and disallow growing a VM instead of crashing the VM later on.

For anything file-backed (MAP_SHARED) this is fairly easy: fallocate() 
can preallocate memory. If it fails, there is not sufficient backing 
storage. (it might be nice to also only reserve and not preallocate for 
hugetlbfs, but that's another story)

For anonymous memory / MAP_PRIVATE it's complicated. I want to avoid any 
kinds of remapping (mmap(MAP_FIXED | !MAP_NORESERVE)) within the sparse 
region, as it is expensive, I can easily run into too mapping limits, 
and it creates quite some problems with other parallel features that are 
enabled (e.g., userfaultfd).

So I actually want to decide myself how much memory is reserved, have a 
way to increase it (and fail if impossible) or decrease it. Doing this 
per VMA is not possible, as it's unclear what to do on VMA 
splits/unmappings.

One idea is concurrently resizing a parallel, pre-reserved 
mmap(MAP_PRIVATE|MAP_ANON) area, which would fail when trying to grow it 
via mmap(MAP_FIXED) and there is not sufficient swap. This fells kind of 
wrong to achieve the goal and it might fail due to per-process limits.

My naive approach would be having a syscall that allows for 
increasing/decreasing an additional per-process reservation like:

if (!delta)
	return 0;
if (mmap_write_lock_killable(mm))
	return -EINTR;
if (delta > 0) {
	if (security_vm_enough_memory_mm(mm, delta)) {
		mmap_write_unlock(mm);
		return -ENOMEM;
	}
} else {
	if (-delta >= mm->extra_nr_accounted) {
		mmap_write_unlock(mm);
		return -EINVAL;
	}
	vm_unacct_memory(-delta);
}
mm->extra_nr_accounted += delta;
mmap_write_unlock(mm);
return 0;

Or setting an explicit reservation instead / being able to observe the 
current reservation.

We could limit it to the actual size of all VMAs that are not accounted 
due to MAP_NORESERVE, so we would implicitly check for may_expand_vm(), 
as that has been checked when the mmap(MAP_NORESERVE) was created. Of 
course, we would have to update when unmapping applicable MAP_NORESERVE 
areas (will have to think about temporary remappings in user space). Not 
sure if that is required, but it feels like there should be an upper 
limit besides the one in security_vm_enough_memory_mm()

Which other limits do we have that we would have to consider?

Alternatives? Thoughts? Am I missing something important?

Thanks!

--
Thanks,

David / dhildenb