Re: [RFC 00/11] khugepaged: mTHP support

Dev Jain <dev.jain@xxxxxxx> · Fri, 24 Jan 2025 13:08:11 +0530

On 24/01/25 12:43 pm, Dev Jain wrote:

On 24/01/25 1:54 am, Nico Pache wrote:
On Sun, Jan 19, 2025 at 10:18 PM Dev Jain <dev.jain@xxxxxxx> wrote:

--- snip ---

Althogh to be honest, it's not super clear to me what the benefit 
of the bitmap
is vs just iterating through the PTEs like Dev does; is there a 
significant cost
saving in practice? On the face of it, it seems like it might be 
uneeded complexity.
The bitmap was to encode the state of PMD without needing rescanning
(or refactor a lot of code). We keep the scan runtime constant at 512
(for x86). Dev did some good analysis for this here
https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b- 
aba4b1a441b4@xxxxxxx/

I think I swayed away and over-analyzed, and probably did not make my
main objection clear enough, so let us cut to the chase.
*Why* is it correct to remember the state of the PMD?

In__collapse_huge_page_isolate(), we check the PTEs against the sysfs
tunables again, since we dropped the lock. The bitmap thingy which you
are doing, and in general, any algorithm which tries to remember the
state of the PMD, violates the entire point of max_ptes_*. Take for
example: Suppose the PTE table had a lot of shared ptes. After you drop
the PTL, you do this: scan_bitmap() -> read_unlock() ->
alloc_charge_folio() -> read_lock() -> read_unlock()....which is a lot
per your recommendation I dropped the read_lock() -> read_unlock() and
made it a conditional unlock

That's not the one I was talking about here...

of stuff. Now, you do write_lock(), which means that you need to wait
for all faulting/forking/mremap/mmap etc to stop. Suppose this process
forks and then a lot of PTEs become shared. The point of max_ptes_shared
is to stop the collapse here, since we do not want memory bloat
(collapse will grab more memory from the buddy and the old memory won't
be freed because it has a reference from the parent/child).

That's a fair point, but given the other feedback, my current
implementation now requires mTHPs to have no shared/swap, and ive
improved the sysctl interactions for the set_bitmap and the
max_ptes_none check in the _isolate function.

I am guessing you are following the policy of letting the creep happen 
for none ptes, and assuming shared and swap to be zero.

Ah sorry, I read the thread again and it seems we decided on skipping 
mTHP if max_ptes_none != 0 and 511. In any case, we need to scan the 
range to check whether we have at least one filled /all filled ptes, and 
none of them are shared and swap.

As for *why* remembering the state is correct. It just prevents
needing to rescan.

That is what I am saying...if collapse_huge_page() fails, then you have 
dropped the mmap write lock, so now the state of the PTEs may have 
changed, so you must rescan...

Another example would be, a sysadmin does not want too much memory
wastage from khugepaged, so we decide to set max_ptes_none low. When you
scan the PTE table you justify the collapse. After you drop the PTL and
the mmap_lock, a munmap() happens in the region, no longer justifying
the collapse. If you have a lot of VMAs of size <= 2MB, then any
munmap() on a VMA will happen on the single PTE table present.

So, IMHO before even jumping on analyzing the bitmap algorithm, we need
to ask whether any algorithm remembering the state of the PMD is even
conceptually right.

Both the issues you raised dont really have to do with the bitmap...

Correct, my issue is with any general algorithm remembering PTE state.

they are fair points, but they are more of a criticism of my sysctl
handling. Ive cleaned up the max_ptes_none interactions, and now that
we dont plan to initially support swap/shared both these problems are
'gone'.

Then, you have the harder task of proving that your optimization is
actually an optimization, that it is not turned into being futile
because of overhead. From a high-level mathematical PoV, you are saving
iterations. Any mathematical analysis has the underlying assumption that
every iteration is equal. But the list [pte, pte + 1, ....., pte + (1 <<
order)] is virtually and physically contiguous in memory so prefetching
helps us. You are trying to save on pte memory references, but then look
at the number of bitmap memory references you have created, not to
mention that you are doing a (costly?) division operation in there, you
have a while loop, a stack, new structs, and if conditions. I do not see
how this is any faster than a naive linear scan.

Yeah it's hard to say without real performance testing. I hope to
include some performance results with my next post.

This prevents needing to hold the read lock for longer, and prevents
needing to reacquire it too.

My implementation does not hold the read lock for longer. What you mean
to say is, I need to reacquire the lock, and this is by design, to
yes sorry.
ensure correctness, which boils down to what I wrote above.
The write lock is what ensures correctness, not the read lock. The
read lock is to gain insight of potential collapse candidates while
avoiding the cost of the write lock.

Cheers!
-- Nico