Dynamically reserving swap space for MAP_NORESERVE mappings

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I'm planning on making use of MAP_NORESERVE for sparse memory regions, but I still want to have some way to reduce the chance of running into random OOMs, similar to the ones we have with !MAP_NORESERVE on private mappings. I want dynamic reservations of swap space.

The rough idea is having a large mmap(MAP_NORESERVE) area in which I dynamically populate/discard memory to control the memory consumption, similar to a memory allocator - but rather in the context of dynamically resizing VMs. In case the user requests a dangerous configurations ("add 50GB" instead of "add 5GB"), I rather want to fail in a nice way early and disallow growing a VM instead of crashing the VM later on.

For anything file-backed (MAP_SHARED) this is fairly easy: fallocate() can preallocate memory. If it fails, there is not sufficient backing storage. (it might be nice to also only reserve and not preallocate for hugetlbfs, but that's another story)

For anonymous memory / MAP_PRIVATE it's complicated. I want to avoid any kinds of remapping (mmap(MAP_FIXED | !MAP_NORESERVE)) within the sparse region, as it is expensive, I can easily run into too mapping limits, and it creates quite some problems with other parallel features that are enabled (e.g., userfaultfd).


So I actually want to decide myself how much memory is reserved, have a way to increase it (and fail if impossible) or decrease it. Doing this per VMA is not possible, as it's unclear what to do on VMA splits/unmappings.

One idea is concurrently resizing a parallel, pre-reserved mmap(MAP_PRIVATE|MAP_ANON) area, which would fail when trying to grow it via mmap(MAP_FIXED) and there is not sufficient swap. This fells kind of wrong to achieve the goal and it might fail due to per-process limits.

My naive approach would be having a syscall that allows for increasing/decreasing an additional per-process reservation like:

if (!delta)
	return 0;
if (mmap_write_lock_killable(mm))
	return -EINTR;
if (delta > 0) {
	if (security_vm_enough_memory_mm(mm, delta)) {
		mmap_write_unlock(mm);
		return -ENOMEM;
	}
} else {
	if (-delta >= mm->extra_nr_accounted) {
		mmap_write_unlock(mm);
		return -EINVAL;
	}
	vm_unacct_memory(-delta);
}
mm->extra_nr_accounted += delta;
mmap_write_unlock(mm);
return 0;

Or setting an explicit reservation instead / being able to observe the current reservation.


We could limit it to the actual size of all VMAs that are not accounted due to MAP_NORESERVE, so we would implicitly check for may_expand_vm(), as that has been checked when the mmap(MAP_NORESERVE) was created. Of course, we would have to update when unmapping applicable MAP_NORESERVE areas (will have to think about temporary remappings in user space). Not sure if that is required, but it feels like there should be an upper limit besides the one in security_vm_enough_memory_mm()

Which other limits do we have that we would have to consider?

Alternatives? Thoughts? Am I missing something important?

Thanks!

--
Thanks,

David / dhildenb






[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux