Re: [PATCH] mm: Compute mTHP order efficiently

Barry Song <baohua@xxxxxxxxxx> · Tue, 17 Sep 2024 16:44:30 +0800

On Tue, Sep 17, 2024 at 4:29 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>
> On 17/09/2024 04:55, Dev Jain wrote:
> >
> > On 9/16/24 18:54, Matthew Wilcox wrote:
> >> On Fri, Sep 13, 2024 at 02:49:02PM +0530, Dev Jain wrote:
> >>> We use pte_range_none() to determine whether contiguous PTEs are empty
> >>> for an mTHP allocation. Instead of iterating the while loop for every
> >>> order, use some information, which is the first set PTE found, from the
> >>> previous iteration, to eliminate some cases. The key to understanding
> >>> the correctness of the patch is that the ranges we want to examine
> >>> form a strictly decreasing sequence of nested intervals.
> >> This is a lot more complicated.  Do you have any numbers that indicate
> >> that it's faster?  Yes, it's fewer memory references, but you've gone
> >> from a simple linear scan that's easy to prefetch to an exponential scan
> >> that might confuse the prefetchers.
> >
> > I do have some numbers, I tested with a simple program, and also used
> > ktime API, with the latter, enclosing from "order = highest_order(orders)"
> > till "pte_unmap(pte)" (enclosing the entire while loop), a rough average
> > estimate is that without the patch, it takes 1700 ns to execute, with the
> > patch, on an average it takes 80 - 100ns less. I cannot think of a good
> > testing program...
> >
> > For the prefetching thingy, I am still doing a linear scan, and in each
> > iteration, with the patch, the range I am scanning is going to strictly
> > lie inside the range I would have scanned without the patch. Won't the
> > compiler and the CPU still do prefetching, but on a smaller range; where
> > does the prefetcher get confused? I confess, I do not understand this
> > very well.
> >
>
> A little history on this; My original "RFC v2" for mTHP included this
> optimization [1], but Yu Zhou suggested dropping it to keep things simple, which
> I did. Then at v8, DavidH suggested we could benefit from this sort of
> optimization, but we agreed to do it later as a separate change [2]:
>
> """
> >> Comment: Likely it would make sense to scan only once and determine the
> >> "largest none range" around that address, having the largest suitable order
> >> in mind.
> >
> > Yes, that's how I used to do it, but Yu Zhou requested simplifying to this,
> > IIRC. Perhaps this an optimization opportunity for later?
>
> Yes, definetly.
> """
>
> Dev independently discovered this opportunity while reading the code, and I
> pointed him to the history, and suggested it would likely be worthwhile to send
> a patch.
>
> My view is that I don't see how this can harm performance; in the common case,
> when a single order is enabled, this is essentially the same as before. But when
> there are multiple orders enabled, we are now just doing a single linear scan of
> the ptes rather than multiple scans. There will likely be some stack accesses
> interleved, but I'd be gobsmacked if the prefetchers can't tell the difference
> between the stack and other areas of memory.
>
> Perhaps some perf numbers would help; I think the simplest way to gather some
> numbers would be to create a microbenchmark to allocate a large VMA, then fault
> in single pages at a given stride (say, 1 every 128K), then enable 1M, 512K,
> 256K, 128K and 64K mTHP, then memset the entire VMA. It's a bit contrived, but
> this patch will show improvement if the scan is currently a significant portion
> of the page fault.
>
> If the proposed benchmark shows an improvement, and we don't see any regression
> when only enabling 64K, then my vote would be to accept the patch.

Agreed. The challenge now is how to benchmark this. In a system
without fragmentation,
we consistently succeed in allocating the largest size (1MB).
Therefore, we need an
environment where allocations of various sizes can fail proportionally, allowing
pte_range_none() to fail on larger sizes but succeed on smaller ones.

It seems we can't micro-benchmark this with a small program.

>
> [1] https://lore.kernel.org/linux-mm/20230414130303.2345383-7-ryan.roberts@xxxxxxx/
> [2]
> https://lore.kernel.org/linux-mm/ca649aad-7b76-4c6d-b513-26b3d58f8e68@xxxxxxxxxx/
>
> Thanks,
> Ryan