Re: [PATCH v21 1/4] mm: add VM_DROPPABLE for designating always lazily freeable mappings

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 07.07.24 02:26, Jason A. Donenfeld wrote:

Hi,

having more generic support for VM_DROPPABLE sounds great, I was myself at some point looking for something like that.

The vDSO getrandom() implementation works with a buffer allocated with a
new system call that has certain requirements:

- It shouldn't be written to core dumps.
   * Easy: VM_DONTDUMP.
- It should be zeroed on fork.
   * Easy: VM_WIPEONFORK.

- It shouldn't be written to swap.
   * Uh-oh: mlock is rlimited.
   * Uh-oh: mlock isn't inherited by forks.

It turns out that the vDSO getrandom() function has three really nice
characteristics that we can exploit to solve this problem:

1) Due to being wiped during fork(), the vDSO code is already robust to
    having the contents of the pages it reads zeroed out midway through
    the function's execution.

2) In the absolute worst case of whatever contingency we're coding for,
    we have the option to fallback to the getrandom() syscall, and
    everything is fine.

3) The buffers the function uses are only ever useful for a maximum of
    60 seconds -- a sort of cache, rather than a long term allocation.

These characteristics mean that we can introduce VM_DROPPABLE, which
has the following semantics:

a) It never is written out to swap.
b) Under memory pressure, mm can just drop the pages (so that they're
    zero when read back again).
c) It is inherited by fork.
d) It doesn't count against the mlock budget, since nothing is locked.

This is fairly simple to implement, with the one snag that we have to
use 64-bit VM_* flags, but this shouldn't be a problem, since the only
consumers will probably be 64-bit anyway.

This way, allocations used by vDSO getrandom() can use:

     VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE

And there will be no problem with using memory when not in use, not
wiping on fork(), coredumps, or writing out to swap.

In order to let vDSO getrandom() use this, expose these via mmap(2) as
well, giving MAP_WIPEONFORK, MAP_DONTDUMP, and MAP_DROPPABLE.


Patch subject would be better to talk about MAP_DROPPABLE now.

But I don't immediately see why MAP_WIPEONFORK and MAP_DONTDUMP have to be mmap() flags. Using mmap(MAP_NORESERVE|MAP_DROPPABLE) with madvise() to configure these (for users that require that) should be good enough, just like they are for existing users.

Thinking out loud, also MAP_DROPPABLE only sets a VMA flag (and does snot affect memory commitiing like MAP_NORESERVE), right? So MAP_DROPPABLE could easily become a madvise() option as well?

(as you know, we only have limited mmap bits but plenty of madvise numbers available)


Interestingly, when looking into something comparable in the past I stumbled over "vrange" [1], which would have had a slightly different semantic (signal on reaccess). And that did turn out to be more sutibale for madvise() flags [2], whereby vrange evolved into MADV_VOLATILE/MADV_NONVOLATILE

A sticky MADV_VOLATILE vs. MADV_NONVOLATILE would actually sound pretty handy. (again, with your semantics, not the signal-on-reaccess kind of thing)

([2] is in general a good read; hey, it's been 10 years since that was brought up the last time!)


There needs to be better reasoning why we have to consume three mmap bits for something that can likely be achieved without any.

Maybe that was discussed with Linus and there is a pretty good reason for that.

I'll also mention that I am unsure how MAP_DROPPABLE is supposed to interact with mlock. Maybe just like MADV_FREE currently does (no idea if that will work as intended ;) ).


[1] https://lwn.net/Articles/590991/
[2] https://lwn.net/Articles/602650/


Finally, the provided self test ensures that this is working as desired.

Cc: linux-mm@xxxxxxxxx
Signed-off-by: Jason A. Donenfeld <Jason@xxxxxxxxx>
---


[...]

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8c6cd8825273..57b8dad9adcc 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -623,7 +623,7 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
  				may_expand_vm(mm, oldflags, nrpages))
  			return -ENOMEM;
  		if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_HUGETLB|
-						VM_SHARED|VM_NORESERVE))) {
+				  VM_SHARED|VM_NORESERVE|VM_DROPPABLE))) {
  			charged = nrpages;
  			if (security_vm_enough_memory_mm(mm, charged))
  				return -ENOMEM;

I don't quite understand this change here. If MAP_DROPPABLE does not affect memory accounting during mmap(), it should not affect the same during mprotect(). VM_NORESERVE / MAP_NORESERVE is responsible for that.

Did I missing something where MAP_DROPPABLE changes the memory accounting during mmap()?

diff --git a/mm/rmap.c b/mm/rmap.c
index e8fc5ecb59b2..56d7535d5cf6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1397,7 +1397,10 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
  	VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
  	VM_BUG_ON_VMA(address < vma->vm_start ||
  			address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
-	__folio_set_swapbacked(folio);
+	/* VM_DROPPABLE mappings don't swap; instead they're just dropped when
+	 * under memory pressure. */
+	if (!(vma->vm_flags & VM_DROPPABLE))
+		__folio_set_swapbacked(folio);
  	__folio_set_anon(folio, vma, address, true);
if (likely(!folio_test_large(folio))) {
@@ -1841,7 +1844,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
  				 * plus the rmap(s) (dropped by discard:).
  				 */
  				if (ref_count == 1 + map_count &&
-				    !folio_test_dirty(folio)) {
+				    (!folio_test_dirty(folio) ||
+				     /* Unlike MADV_FREE mappings, VM_DROPPABLE
+				      * ones can be dropped even if they've
+				      * been dirtied. */

We use

/*
 * Comment start
 * Comment end
 */

styled comments in MM.

+				     (vma->vm_flags & VM_DROPPABLE))) {
  					dec_mm_counter(mm, MM_ANONPAGES);
  					goto discard;
  				}
@@ -1851,7 +1858,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
  				 * discarded. Remap the page to page table.
  				 */
  				set_pte_at(mm, address, pvmw.pte, pteval);
-				folio_set_swapbacked(folio);
+				/* Unlike MADV_FREE mappings, VM_DROPPABLE ones
+				 * never get swap backed on failure to drop. */
+				if (!(vma->vm_flags & VM_DROPPABLE))
+					folio_set_swapbacked(folio);
  				ret = false;
  				page_vma_mapped_walk_done(&pvmw);
  				break;

A note that in mm/mm-stable, "madvise_free_huge_pmd" exists to optimize MADV_FREE on PMDs. I suspect we'd want to extend that one as well for dropping support, but likely it would also only be a performance improvmeent and not affect functonality if not handled.

--
Cheers,

David / dhildenb





[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux