Re: [QUESTION] mmap lock in khugepaged

Dev Jain <dev.jain@xxxxxxx> · Fri, 29 Nov 2024 12:40:08 +0530

+the list and Zi Yan.

On 28/11/24 5:04 pm, David Hildenbrand wrote:

Maybe, we'd have to do the isolation+copy under the PMD lock. And
currently, we have to drop the PMD lock in order to have the
pte_offset_map_lock() work IIRC.

Is there a problem in holding two page table locks simultaneously?

Depending on CONFIG_SPLIT_PTE_PTLOCKS, it might not even be two locks 
(I assume we could have such configs with khugepaged).

Not sure if there could be an issue with lock inversion.

So I suspect this no not be 100% trivial :)

Most importantly, the copy that currently runs under no spinlocks
would now run under spinlock. Up to 512 MiB on arm64 64K, not sure if
that can be a problem ... we currently seem to take care of that

But we already are taking mmap_write_lock(), so that should not matter?

We are dealing with a spinlock vs. a rwsem.

We usually want to avoid holding spinlocks for an excessive amount of 
time, because all other CPUs waiting for that lock will ... spin with 
preemption disabled instead of rescheduling and doing something useful.

Ah okay.

Further, without CONFIG_SPLIT_PMD_PTLOCKS, in fact everybody who wnats 
to take a PMD lock in that process would be spinning on the same PMD 
lock :)

I mean, if we can get rid of the mmap exclusive lock, then the copying
would still be a bottleneck, and all fault handlers will back off, but

I'm trying to digest it once again, but I'm afraid I don't see how 
fault handlers will back off.

Won't they either see pmd_none(), to then either call pte_alloc_one() 
where they would spin on the PMD lock, or try allocating a PMD THP to 
insert it, and then spin on the PMD lock, to figure out later that it 
was all in vain?

Ya that's what I meant when I said "back off", sorry I wasn't clear, I 
didn't mean VM_FAULT_FALLBACK/RETRY or something...yes, the work will be 
in vain and the MMU will refault...

Thinking about it, I am also not sure if most other code can deal with 
temporary pmd_none(). These don't necessarily take the PMD lock, 
because "why would they" right now.

See walk_pmd_range() as one example, if it spots pmd_none() it assumes 
"there really is nothing" without checking the PMD lock.

As a more concrete example, assume someone calls MADV_DONTNEED and we 
end up in zap_pmd_range(). If we assume "pmd_none() == really nothing" 
we'll skip that entry without getting the PMD lock involved. That 
would mean that you would be breaking MADV_DONTNEED if you managed to 
collapse or not -- memory would not get discarded.

This is a real problem with anonymous memory.

Yes, I thought of different locking fashions but the problem seems to be 
that any pagetable walker will choose an action path according to the value
it sees.

Unless I am missing something it's all very tricky and there might be 
a lot of such code that assumes "if I hold a mmap lock / VMA lock in 
read mode, pmd_none() means there is nothing even without holding the 
PMD lock when checking".

Yes, I was betting on the fact that, if the code assumes that pmd_none() 
means there is nothing, eventually when it will take the PMD lock to 
write to it, it will check whether
the PMD changed, and back off. I wasn't aware of the MADV_DONTNEED 
thingy, thanks.

at least processes will be able to mmap() and do stuff with their VMAs,
and I would guess that this is worth optimizing...

It would certainly be interesting to get rid of the mmap lock in write 
mode here, but it's all rater tricky (and the code has rather nasty 
hidden implications).

pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
if (pte) {
     ...
     spun_unlock(pte);
} ...

result = __collapse_huge_page_copy(...);
pte_unmap(pte);

Deep in __collapse_huge_page_copy() we seem to re-rake the PTL lock.
No-split-spinlock confiogs might be problematic ...

Could you elaborate a little? I haven't read about the older config...

See above regarding CONFIG_SPLIT_PTE_PTLOCKS and friends.

I recall that for shmem that's "easier", because we don't have to
reinstall a PMD immediately, we cna be sure that the page table is
kept empty/unmodified, ...

All in all, the general solution seems to be that, if I can take all 
pagetable walkers into an invalid state and make them backoff, then I am 
safe.
For example, we do not zero out the PMD, we take the pte PTL, we do 
stuff, then instead of setting the PTEs to zero, we set it to a universal
invalid state upon which no pagetable walker can take an action; an 
instance of that can be to set the PTE to a swap entry such that the fault
handler ends up in do_swap_page() ->print_bad_pte(). So now we can take 
the PMD lock (in fact we don't need it since any pagetable walking
is rendered useless) and populate the PMD to resume the new pagetable 
walking. Another *ridiculous* idea may be to remember the PGD we
came from and nuke it (but I guess there is an alternate path for that 
in walk_pgd_range() and so on?)