Re: [RFC 00/11] khugepaged: mTHP support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





--- snip ---

Althogh to be honest, it's not super clear to me what the benefit of the bitmap
is vs just iterating through the PTEs like Dev does; is there a significant cost
saving in practice? On the face of it, it seems like it might be uneeded complexity.
The bitmap was to encode the state of PMD without needing rescanning
(or refactor a lot of code). We keep the scan runtime constant at 512
(for x86). Dev did some good analysis for this here
https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b-aba4b1a441b4@xxxxxxx/

I think I swayed away and over-analyzed, and probably did not make my main objection clear enough, so let us cut to the chase.
*Why* is it correct to remember the state of the PMD?

In__collapse_huge_page_isolate(), we check the PTEs against the sysfs tunables again, since we dropped the lock. The bitmap thingy which you are doing, and in general, any algorithm which tries to remember the state of the PMD, violates the entire point of max_ptes_*. Take for example: Suppose the PTE table had a lot of shared ptes. After you drop the PTL, you do this: scan_bitmap() -> read_unlock() -> alloc_charge_folio() -> read_lock() -> read_unlock()....which is a lot of stuff. Now, you do write_lock(), which means that you need to wait for all faulting/forking/mremap/mmap etc to stop. Suppose this process forks and then a lot of PTEs become shared. The point of max_ptes_shared is to stop the collapse here, since we do not want memory bloat (collapse will grab more memory from the buddy and the old memory won't be freed because it has a reference from the parent/child). Another example would be, a sysadmin does not want too much memory wastage from khugepaged, so we decide to set max_ptes_none low. When you scan the PTE table you justify the collapse. After you drop the PTL and the mmap_lock, a munmap() happens in the region, no longer justifying the collapse. If you have a lot of VMAs of size <= 2MB, then any munmap() on a VMA will happen on the single PTE table present.

So, IMHO before even jumping on analyzing the bitmap algorithm, we need to ask whether any algorithm remembering the state of the PMD is even conceptually right.

Then, you have the harder task of proving that your optimization is actually an optimization, that it is not turned into being futile because of overhead. From a high-level mathematical PoV, you are saving iterations. Any mathematical analysis has the underlying assumption that every iteration is equal. But the list [pte, pte + 1, ....., pte + (1 << order)] is virtually and physically contiguous in memory so prefetching helps us. You are trying to save on pte memory references, but then look at the number of bitmap memory references you have created, not to mention that you are doing a (costly?) division operation in there, you have a while loop, a stack, new structs, and if conditions. I do not see how this is any faster than a naive linear scan.

This prevents needing to hold the read lock for longer, and prevents
needing to reacquire it too.

My implementation does not hold the read lock for longer. What you mean to say is, I need to reacquire the lock, and this is by design, to ensure correctness, which boils down to what I wrote above.





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux