Re: [QUESTION] mmap lock in khugepaged

Dev Jain <dev.jain@xxxxxxx> · Mon, 2 Dec 2024 15:24:18 +0530

On 29/11/24 5:00 pm, David Hildenbrand wrote:

Thinking about it, I am also not sure if most other code can deal with
temporary pmd_none(). These don't necessarily take the PMD lock,
because "why would they" right now.

See walk_pmd_range() as one example, if it spots pmd_none() it assumes
"there really is nothing" without checking the PMD lock.

As a more concrete example, assume someone calls MADV_DONTNEED and we
end up in zap_pmd_range(). If we assume "pmd_none() == really nothing"
we'll skip that entry without getting the PMD lock involved. That
would mean that you would be breaking MADV_DONTNEED if you managed to
collapse or not -- memory would not get discarded.

This is a real problem with anonymous memory.

Yes, I thought of different locking fashions but the problem seems to be
that any pagetable walker will choose an action path according to the 
value
it sees.

Unless I am missing something it's all very tricky and there might be
a lot of such code that assumes "if I hold a mmap lock / VMA lock in
read mode, pmd_none() means there is nothing even without holding the
PMD lock when checking".

Yes, I was betting on the fact that, if the code assumes that pmd_none()
means there is nothing, eventually when it will take the PMD lock to
write to it, it will check whether
the PMD changed, and back off. I wasn't aware of the MADV_DONTNEED
thingy, thanks.

Note that this is just the tip of the iceberg. Most page table walkers 
that deal with anonymous memory have the same requirement.

I recall that for shmem that's "easier", because we don't have to
reinstall a PMD immediately, we cna be sure that the page table is
kept empty/unmodified, ...

All in all, the general solution seems to be that, if I can take all
pagetable walkers into an invalid state and make them backoff, then I am
safe.
> For example, we do not zero out the PMD, we take the pte PTL, we do> 
stuff, then instead of setting the PTEs to zero, we set it to a universal
invalid state upon which no pagetable walker can take an action; an
instance of that can be to set the PTE to a swap entry such that the 
fault
handler ends up in do_swap_page() ->print_bad_pte(). So now we can take
the PMD lock (in fact we don't need it since any pagetable walking
is rendered useless) and populate the PMD to resume the new pagetable
walking. Another *ridiculous* idea may be to remember the PGD we
came from and nuke it (but I guess there is an alternate path for that
in walk_pgd_range() and so on?)

What might work is introducing a PMD marker (note: we don't have PMD 
markers yet) for that purpose. Likely, the PMD marker may only be set 
while the PMD lock is being held, and we must not drop the PMD lock 
temporarily (otherwise people will spin on the value by repeadetly 
(un)locking the PMD lock, which is stupid).

Then you have to make sure that each and every page table walker 
handles that properly.

Then, there is the problem with holding the PMD lock for too long as I 
mentioned.

Last but not least, there are more issues that I haven't even 
described before (putting aside the other issue):

Our page table walkers can handle the transitions:

PMD leaf -> PMD none
* concurrent zap / MADV_DONTNEED / !anon PMD split

PMD leaf -> PTE table
* concurrent anon PMD split
* concurrent zap / MADV_DONTNEED / !anon PMD split + PTE table
  allocation

PTE table (empty) -> PMD none
* concurrent page table reclaim, part of collapsing file THPs
 * collapse_pte_mapped_thp()
 * retract_page_tables()

I think they *cannot* tolerate the transition *properly*:

PTE table (non-empty) -> PMD leaf
* what you want do do ;)

-> Observe how we skip to the next PMD in all page table walkers /  
give up if pte_offset_map_lock() and friends fail! I even think there 
are more issues hiding there with pte_offset_map_ro_nolock().

Of course, on top of all of this, you should net significantly degrade 
the ordinary page table walker performance ...

I don't want to discourage you, but it's all extremely complicated.

Thanks for the detailed reply!