Re: [RFC PATCH 0/3] asynchronously scan and free empty user PTE pages

David Hildenbrand <david@xxxxxxxxxx> · Fri, 14 Jun 2024 09:53:51 +0200

My thinking is, we start with a madvise(MADV_PT_RECLAIM) that will
synchronously try to reclaim page tables without any asynchronous work.

Similar to MADV_COLLAPSE that only does synchronous work. Of course,

This is feasible, but I worry that some user-mode programs may not be
able to determine when to call it.

Some yes, but others clearly :) Meaning, it's one step into the right 
direction without having to worry about asynchronous work in the kernel 
for now. That doesn't mean that asynchronous option is off the table.

My previous idea was to do something similar to madvise(MADV_HUGEPAGE),
just mark the vma as being able to reclaim the pgtable, and then hand
it over to the background thread for asynchronous reclaim.

That's one option, although there might be workloads where you really 
don't have to scan asynchronously and possibly repeatedly.

For example, after virtio-mem discarded some memory it hotunplugged from 
a VM using MADV_DONTNEED (in a sequence of multiple steps), it could 
just setup a timer to free up page tables after a while exactly once. No 
need to scan repeatedly / multiple times if virtio-mem didn't remove any 
memory from a VM.

For memory allocators it could be similar: trigger it once (from another 
thread?) on a range after sufficient freeing happened. If the workload 
is mostly idle, there might not be a need to free up memory.

(mostly focused on anonymous memory + shmem for now. With file-backed 
memory it might be different, but that has so far not been the biggest 
consumer we saw regarding page tables.)

Of course, for real asynchronous/automatic scanning in the kernel, one 
could try finding clues when scanning is reasonable: for example, mark 
page tables that have been scanned and there was nothing to reclaim, and 
mark page tables when modifying them. But such optimizations are rather 
future work I guess, because devil is in the detail.

if we don't need any heavy locking for reclaim, we might also just
try reclaiming during MADV_DONTNEED when spanning a complete page

I think the lock held by the current solution is not too heavy and
should be acceptable.

But for MADV_FREE case, it still needs to be handled by
madvise(MADV_PT_RECLAIM) or asynchronous work.

Yes. Interestingly, reclaim code might be able to do that scanning + 
reclaim if locking is cheap.

table. That won't sort out all cases where reclaim is possible, but
with both approaches we could cover quite a lot that were discovered
to really result in a lot of emprt page tables.

Yes, agree.

On top, we might implement some asynchronous scanning later, This is,
of course, TBD. Maybe we could wire up other page table scanners
(khugepaged ?) to simply reclaim empty page tables it finds as well?

This is also an idea. Another option may be some pgtable scanning paths,
such as MGLRU.

Exactly.

When scanning, we can filter out some unsuitable vmas:

       - VM_HUGETLB vma
       - VM_UFFD_WP vma

Why is UFFD_WP unsuitable? It should be suitable as long as you make
sure to really only remove page tables that are all pte_none().

Got it, I mistakenly thought pte_none() covered pte marker case until
I saw pte_none_mostly().

I *think* there is one nasty detail, and we might need an arch callback
to test if a pte is *really* can be reclaimed: for example, s390x might
require us keeping some !pte_none() page tables.

While a PTE might be none, the s390x PGSTE (think of it as another
8byte per PTE entry stored right next to the actual page table
entries) might hold data we might have to preserve for our KVM guest.

Oh, thanks for adding this background information!

But that should be easy to wire up.

That's good!

       - etc
And for some PTE pages that spans multiple vmas, we can also skip.

For locking:

       - use the mmap read lock to traverse the vma tree and pgtable
       - use pmd lock for clearing pmd entry
       - use pte lock for checking empty PTE page, and release it after
clearing
         pmd entry, then we can capture the changed pmd in
pte_offset_map_lock()
         etc after holding this pte lock. Thanks to this, we don't need
to hold the
         rmap-related locks.
       - users of pte_offset_map_lock() etc all expect the PTE page to
be stable by
         using rcu lock, so use pte_free_defer() to free PTE pages.

I once had a protoype that would scan similar to GUP-fast, using the
mmap lock in read mode and disabling local IRQs and then walking the
page table locklessly (no PTLs). Only when identifying an empty page and
ripping out the page table, it would have to do more heavy locking (back
when we required the mmap lock in write mode and other things).

Maybe mmap write lock is not necessary, we can protect it using pmd lock
&& pte lock as above.

Yes, I'm hoping we can do that, that will solve a lot of possible issues.

Yes, I think the protection provided by the locks above is enough. Of
course, it would be better if more people could double-check it.

I can try digging up that patch if you're interested.

Yes, that would be better, maybe it can provide more inspiration!

I pushed it to
      https://github.com/davidhildenbrand/linux/tree/page_table_reclaim

I suspect it's a non-working version (and I assume the locking is
broken, there
are no VMA checks, etc), it's an old prototype. Just to give you an idea
about the
lockless scanning and how I started by triggering reclaim only when
kicked-off by
user space.

Many thanks! But I'm worried that on some platforms disbaling the IRQ
might be more expensive than holding the lock, such as arm64? Not sure.

Scanning completely lockless (no mmap lock, not PT locks), means that -- 
as long as there is not much to reclaim (for most workloads the common 
case!) -- you would not affect the workload at all.

Take a look at the khugepaged logic that does mmap_read_trylock(mm) and 
makes sure to drop the mmap lock frequently due to 
khugepaged_pages_to_scan, to not affect the workload too much while 
scanning.

We'll have to double check whether all anon memory cases can *properly*
handle pte_offset_map_lock() failing (not just handling it, but doing
the right thing; most of that anon-only code didn't ever run into that
issue so far, so these code paths were likely never triggered).

Yeah, I'll keep checking this out too.

For the path that will also free PTE pages in THP, we need to recheck
whether the
content of pmd entry is valid after holding pmd lock or pte lock.

4. TODO
=======

Some applications may be concerned about the overhead of scanning and
rebuilding
page tables, so the following features are considered for
implementation in the
future:

       - add per-process switch (via prctl)
       - add a madvise option (like THP)
       - add MM_PGTABLE_SCAN_DELAY/MM_PGTABLE_SCAN_SIZE control (via
procfs file)
Perhaps we can add the refcount to PTE pages in the future as well,
which would
help improve the scanning speed.

I didn't like the added complexity last time, and the problem of
handling situations where we squeeze multiple page tables into a single
"struct page".

OK, except for refcount, do you think the other three todos above are
still worth doing?

I think the question is from where we start: for example, only synchronous
reclaim vs. asynchonous reclaim. Synchronous reclaim won't really affect
workloads that do not actively trigger it, so it raises a lot less
eyebrows. ...
and some user space might have a good idea where it makes sense to try to
reclaim, and when.

So the other things you note here rather affect asynchronous reclaim, and
might be reasonable in that context. But not sure if we should start
with doing
things asynchronously.

I think synchronous and asynchronous have their own advantages and
disadvantages, and are complementary. Perhaps they can be implemented at
the same time?

No strong opinion, something synchronous sounds to me like the 
low-hanging fruit, that could add the infrastructure to be used by 
something more advanced/synchronously :)

--
Cheers,

David / dhildenb