Re: [PATCH] mm: Compute mTHP order efficiently

Dev Jain <dev.jain@xxxxxxx> · Tue, 17 Sep 2024 09:25:27 +0530

On 9/16/24 18:54, Matthew Wilcox wrote:
On Fri, Sep 13, 2024 at 02:49:02PM +0530, Dev Jain wrote:
We use pte_range_none() to determine whether contiguous PTEs are empty
for an mTHP allocation. Instead of iterating the while loop for every
order, use some information, which is the first set PTE found, from the
previous iteration, to eliminate some cases. The key to understanding
the correctness of the patch is that the ranges we want to examine
form a strictly decreasing sequence of nested intervals.
This is a lot more complicated.  Do you have any numbers that indicate
that it's faster?  Yes, it's fewer memory references, but you've gone
from a simple linear scan that's easy to prefetch to an exponential scan
that might confuse the prefetchers.

I do have some numbers, I tested with a simple program, and also used
ktime API, with the latter, enclosing from "order = highest_order(orders)"
till "pte_unmap(pte)" (enclosing the entire while loop), a rough average
estimate is that without the patch, it takes 1700 ns to execute, with the
patch, on an average it takes 80 - 100ns less. I cannot think of a good
testing program...

For the prefetching thingy, I am still doing a linear scan, and in each
iteration, with the patch, the range I am scanning is going to strictly
lie inside the range I would have scanned without the patch. Won't the
compiler and the CPU still do prefetching, but on a smaller range; where
does the prefetcher get confused? I confess, I do not understand this
very well.