On Tue, Sep 17, 2024 at 4:29 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > > On 17/09/2024 04:55, Dev Jain wrote: > > > > On 9/16/24 18:54, Matthew Wilcox wrote: > >> On Fri, Sep 13, 2024 at 02:49:02PM +0530, Dev Jain wrote: > >>> We use pte_range_none() to determine whether contiguous PTEs are empty > >>> for an mTHP allocation. Instead of iterating the while loop for every > >>> order, use some information, which is the first set PTE found, from the > >>> previous iteration, to eliminate some cases. The key to understanding > >>> the correctness of the patch is that the ranges we want to examine > >>> form a strictly decreasing sequence of nested intervals. > >> This is a lot more complicated. Do you have any numbers that indicate > >> that it's faster? Yes, it's fewer memory references, but you've gone > >> from a simple linear scan that's easy to prefetch to an exponential scan > >> that might confuse the prefetchers. > > > > I do have some numbers, I tested with a simple program, and also used > > ktime API, with the latter, enclosing from "order = highest_order(orders)" > > till "pte_unmap(pte)" (enclosing the entire while loop), a rough average > > estimate is that without the patch, it takes 1700 ns to execute, with the > > patch, on an average it takes 80 - 100ns less. I cannot think of a good > > testing program... > > > > For the prefetching thingy, I am still doing a linear scan, and in each > > iteration, with the patch, the range I am scanning is going to strictly > > lie inside the range I would have scanned without the patch. Won't the > > compiler and the CPU still do prefetching, but on a smaller range; where > > does the prefetcher get confused? I confess, I do not understand this > > very well. > > > > A little history on this; My original "RFC v2" for mTHP included this > optimization [1], but Yu Zhou suggested dropping it to keep things simple, which > I did. Then at v8, DavidH suggested we could benefit from this sort of > optimization, but we agreed to do it later as a separate change [2]: > > """ > >> Comment: Likely it would make sense to scan only once and determine the > >> "largest none range" around that address, having the largest suitable order > >> in mind. > > > > Yes, that's how I used to do it, but Yu Zhou requested simplifying to this, > > IIRC. Perhaps this an optimization opportunity for later? > > Yes, definetly. > """ > > Dev independently discovered this opportunity while reading the code, and I > pointed him to the history, and suggested it would likely be worthwhile to send > a patch. > > My view is that I don't see how this can harm performance; in the common case, > when a single order is enabled, this is essentially the same as before. But when > there are multiple orders enabled, we are now just doing a single linear scan of > the ptes rather than multiple scans. There will likely be some stack accesses > interleved, but I'd be gobsmacked if the prefetchers can't tell the difference > between the stack and other areas of memory. > > Perhaps some perf numbers would help; I think the simplest way to gather some > numbers would be to create a microbenchmark to allocate a large VMA, then fault > in single pages at a given stride (say, 1 every 128K), then enable 1M, 512K, > 256K, 128K and 64K mTHP, then memset the entire VMA. It's a bit contrived, but > this patch will show improvement if the scan is currently a significant portion > of the page fault. > > If the proposed benchmark shows an improvement, and we don't see any regression > when only enabling 64K, then my vote would be to accept the patch. Agreed. The challenge now is how to benchmark this. In a system without fragmentation, we consistently succeed in allocating the largest size (1MB). Therefore, we need an environment where allocations of various sizes can fail proportionally, allowing pte_range_none() to fail on larger sizes but succeed on smaller ones. It seems we can't micro-benchmark this with a small program. > > [1] https://lore.kernel.org/linux-mm/20230414130303.2345383-7-ryan.roberts@xxxxxxx/ > [2] > https://lore.kernel.org/linux-mm/ca649aad-7b76-4c6d-b513-26b3d58f8e68@xxxxxxxxxx/ > > Thanks, > Ryan