Hi,
I'm planning on making use of MAP_NORESERVE for sparse memory regions,
but I still want to have some way to reduce the chance of running into
random OOMs, similar to the ones we have with !MAP_NORESERVE on private
mappings. I want dynamic reservations of swap space.
The rough idea is having a large mmap(MAP_NORESERVE) area in which I
dynamically populate/discard memory to control the memory consumption,
similar to a memory allocator - but rather in the context of dynamically
resizing VMs. In case the user requests a dangerous configurations ("add
50GB" instead of "add 5GB"), I rather want to fail in a nice way early
and disallow growing a VM instead of crashing the VM later on.
For anything file-backed (MAP_SHARED) this is fairly easy: fallocate()
can preallocate memory. If it fails, there is not sufficient backing
storage. (it might be nice to also only reserve and not preallocate for
hugetlbfs, but that's another story)
For anonymous memory / MAP_PRIVATE it's complicated. I want to avoid any
kinds of remapping (mmap(MAP_FIXED | !MAP_NORESERVE)) within the sparse
region, as it is expensive, I can easily run into too mapping limits,
and it creates quite some problems with other parallel features that are
enabled (e.g., userfaultfd).
So I actually want to decide myself how much memory is reserved, have a
way to increase it (and fail if impossible) or decrease it. Doing this
per VMA is not possible, as it's unclear what to do on VMA
splits/unmappings.
One idea is concurrently resizing a parallel, pre-reserved
mmap(MAP_PRIVATE|MAP_ANON) area, which would fail when trying to grow it
via mmap(MAP_FIXED) and there is not sufficient swap. This fells kind of
wrong to achieve the goal and it might fail due to per-process limits.
My naive approach would be having a syscall that allows for
increasing/decreasing an additional per-process reservation like:
if (!delta)
return 0;
if (mmap_write_lock_killable(mm))
return -EINTR;
if (delta > 0) {
if (security_vm_enough_memory_mm(mm, delta)) {
mmap_write_unlock(mm);
return -ENOMEM;
}
} else {
if (-delta >= mm->extra_nr_accounted) {
mmap_write_unlock(mm);
return -EINVAL;
}
vm_unacct_memory(-delta);
}
mm->extra_nr_accounted += delta;
mmap_write_unlock(mm);
return 0;
Or setting an explicit reservation instead / being able to observe the
current reservation.
We could limit it to the actual size of all VMAs that are not accounted
due to MAP_NORESERVE, so we would implicitly check for may_expand_vm(),
as that has been checked when the mmap(MAP_NORESERVE) was created. Of
course, we would have to update when unmapping applicable MAP_NORESERVE
areas (will have to think about temporary remappings in user space). Not
sure if that is required, but it feels like there should be an upper
limit besides the one in security_vm_enough_memory_mm()
Which other limits do we have that we would have to consider?
Alternatives? Thoughts? Am I missing something important?
Thanks!
--
Thanks,
David / dhildenb