On 2024/7/12 9:40, Jason A. Donenfeld wrote: > The vDSO getrandom() implementation works with a buffer allocated with a > new system call that has certain requirements: > > - It shouldn't be written to core dumps. > * Easy: VM_DONTDUMP. > - It should be zeroed on fork. > * Easy: VM_WIPEONFORK. > > - It shouldn't be written to swap. > * Uh-oh: mlock is rlimited. > * Uh-oh: mlock isn't inherited by forks. > > - It shouldn't reserve actual memory, but it also shouldn't crash when > page faulting in memory if none is available > * Uh-oh: VM_NORESERVE means segfaults. > > It turns out that the vDSO getrandom() function has three really nice > characteristics that we can exploit to solve this problem: > > 1) Due to being wiped during fork(), the vDSO code is already robust to > having the contents of the pages it reads zeroed out midway through > the function's execution. > > 2) In the absolute worst case of whatever contingency we're coding for, > we have the option to fallback to the getrandom() syscall, and > everything is fine. > > 3) The buffers the function uses are only ever useful for a maximum of > 60 seconds -- a sort of cache, rather than a long term allocation. > > These characteristics mean that we can introduce VM_DROPPABLE, which > has the following semantics: > > a) It never is written out to swap. > b) Under memory pressure, mm can just drop the pages (so that they're > zero when read back again). > c) It is inherited by fork. > d) It doesn't count against the mlock budget, since nothing is locked. > e) If there's not enough memory to service a page fault, it's not fatal, > and no signal is sent. > > This way, allocations used by vDSO getrandom() can use: > > VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE > > And there will be no problem with OOMing, crashing on overcommitment, > using memory when not in use, not wiping on fork(), coredumps, or > writing out to swap. > > In order to let vDSO getrandom() use this, expose these via mmap(2) as > MAP_DROPPABLE. > > Note that this involves removing the MADV_FREE special case from > sort_folio(), which according to Yu Zhao is unnecessary and will simply > result in an extra call to shrink_folio_list() in the worst case. The > chunk removed reenables the swapbacked flag, which we don't want for > VM_DROPPABLE, and we can't conditionalize it here because there isn't a > vma reference available. > > Finally, the provided self test ensures that this is working as desired. > > Cc: linux-mm@xxxxxxxxx > Acked-by: David Hildenbrand <david@xxxxxxxxxx> > Signed-off-by: Jason A. Donenfeld <Jason@xxxxxxxxx> > --- ... > diff --git a/mm/memory.c b/mm/memory.c > index d10e616d7389..18fe893ce96d 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -5690,6 +5690,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > > lru_gen_exit_fault(); > > + /* If the mapping is droppable, then errors due to OOM aren't fatal. */ > + if (vma->vm_flags & VM_DROPPABLE) > + ret &= ~VM_FAULT_OOM; > + I'm sorry for jumping in here. I am confused about the code in handle_mm_fault(). Since VM_FAULT_OOM is simply dropped, page fault will be re-triggered soon? If so, when oom is disabled or fails to move forward, page fault will re-trigger again and again as no memory is available? I might be miss something. Thanks. .