On 17/09/2024 10:09, Barry Song wrote: > On Tue, Sep 17, 2024 at 4:54 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: >> >> On 17/09/2024 09:44, Barry Song wrote: >>> On Tue, Sep 17, 2024 at 4:29 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: >>>> >>>> On 17/09/2024 04:55, Dev Jain wrote: >>>>> >>>>> On 9/16/24 18:54, Matthew Wilcox wrote: >>>>>> On Fri, Sep 13, 2024 at 02:49:02PM +0530, Dev Jain wrote: >>>>>>> We use pte_range_none() to determine whether contiguous PTEs are empty >>>>>>> for an mTHP allocation. Instead of iterating the while loop for every >>>>>>> order, use some information, which is the first set PTE found, from the >>>>>>> previous iteration, to eliminate some cases. The key to understanding >>>>>>> the correctness of the patch is that the ranges we want to examine >>>>>>> form a strictly decreasing sequence of nested intervals. >>>>>> This is a lot more complicated. Do you have any numbers that indicate >>>>>> that it's faster? Yes, it's fewer memory references, but you've gone >>>>>> from a simple linear scan that's easy to prefetch to an exponential scan >>>>>> that might confuse the prefetchers. >>>>> >>>>> I do have some numbers, I tested with a simple program, and also used >>>>> ktime API, with the latter, enclosing from "order = highest_order(orders)" >>>>> till "pte_unmap(pte)" (enclosing the entire while loop), a rough average >>>>> estimate is that without the patch, it takes 1700 ns to execute, with the >>>>> patch, on an average it takes 80 - 100ns less. I cannot think of a good >>>>> testing program... >>>>> >>>>> For the prefetching thingy, I am still doing a linear scan, and in each >>>>> iteration, with the patch, the range I am scanning is going to strictly >>>>> lie inside the range I would have scanned without the patch. Won't the >>>>> compiler and the CPU still do prefetching, but on a smaller range; where >>>>> does the prefetcher get confused? I confess, I do not understand this >>>>> very well. >>>>> >>>> >>>> A little history on this; My original "RFC v2" for mTHP included this >>>> optimization [1], but Yu Zhou suggested dropping it to keep things simple, which >>>> I did. Then at v8, DavidH suggested we could benefit from this sort of >>>> optimization, but we agreed to do it later as a separate change [2]: >>>> >>>> """ >>>>>> Comment: Likely it would make sense to scan only once and determine the >>>>>> "largest none range" around that address, having the largest suitable order >>>>>> in mind. >>>>> >>>>> Yes, that's how I used to do it, but Yu Zhou requested simplifying to this, >>>>> IIRC. Perhaps this an optimization opportunity for later? >>>> >>>> Yes, definetly. >>>> """ >>>> >>>> Dev independently discovered this opportunity while reading the code, and I >>>> pointed him to the history, and suggested it would likely be worthwhile to send >>>> a patch. >>>> >>>> My view is that I don't see how this can harm performance; in the common case, >>>> when a single order is enabled, this is essentially the same as before. But when >>>> there are multiple orders enabled, we are now just doing a single linear scan of >>>> the ptes rather than multiple scans. There will likely be some stack accesses >>>> interleved, but I'd be gobsmacked if the prefetchers can't tell the difference >>>> between the stack and other areas of memory. >>>> >>>> Perhaps some perf numbers would help; I think the simplest way to gather some >>>> numbers would be to create a microbenchmark to allocate a large VMA, then fault >>>> in single pages at a given stride (say, 1 every 128K), then enable 1M, 512K, >>>> 256K, 128K and 64K mTHP, then memset the entire VMA. It's a bit contrived, but >>>> this patch will show improvement if the scan is currently a significant portion >>>> of the page fault. >>>> >>>> If the proposed benchmark shows an improvement, and we don't see any regression >>>> when only enabling 64K, then my vote would be to accept the patch. >>> >>> Agreed. The challenge now is how to benchmark this. In a system >>> without fragmentation, >>> we consistently succeed in allocating the largest size (1MB). >>> Therefore, we need an >>> environment where allocations of various sizes can fail proportionally, allowing >>> pte_range_none() to fail on larger sizes but succeed on smaller ones. >> >> I don't think this is about allocation failure? It's about finding a folio order >> that fits into the VMA without overlapping any already non-none PTEs. >> >>> >>> It seems we can't micro-benchmark this with a small program. >> >> My proposal was to deliberately fault in a single (4K) page every 128K. That >> will cause the scanning logic to reduce the order to the next lowest enabled >> order and try again. So with the current code, for all orders {1M, 512K, 256K, >> 128K} you would scan the first 128K of ptes (32 entries) then for 64K you would >> scan 16 entries = 4 * 32 + 16 = 144 entries. For the new change, you would just >> scan 32 entries. > > I'm a bit confused. If we have a VMA from 1GB to 1GB+4MB, and even if you > fault in a single 4KB page every 128KB with 1MB enabled, you'd still succeed > in allocating 1MB. For the range 1GB+128KB to 1GB+1MB, wouldn't there be > no page faults? So, you'd still end up scanning 256 entries (1MB/4KB)? Sorry I'm not following this. - start with all mTHP orders disabled. - mmap a 1G region, which is 1G aligned. - write a single byte every 128K throughout the VMA. - causes 1 4K page to be mapped every 32 pages; - 1x4K-present, 31x4K-none, 1x4K-present, 31x4K-none, ... - enable mTHP orders {1M, 512K, 256K, 128K, 64K, 32K, 16K} - madvise(MADV_HUGEPAGE) - write single byte every 4K thoughout the VMA. - causes biggest possible mTHP orders to be allocated in the 31x4K holes - 4x4K, 1x16K, 1x32K, 1x64K, 4x4K, 1x16K, 1x32K, 1x64K Perhaps I didn't make it clear that mTHP would be disabled during the 4K "pre-faulting" phase, then enabled for the "post-faulting" phase? > > For the entire 4MB VMA, we only have 4 page faults? For each page, we scan > 256 entries, and there's no way to scan the next possible larger order, like > 512KB if 1MB has succeeded? > > My point is that we need to make the 1MB allocation fail in order to disrupt the > continuity of pte_none(). otherwise, pte_range_none() will return true for the > largest order. But we can simulate that by putting single 4K entries strategicly in the pgtable. > >> >> Although now that I've actually written that down, it doesn't feel like a very >> big win. Perhaps Dev can come up with an even more contrived single-page >> pre-allocation pattern that will maximise the number of PTEs we hit with the >> current code, and minimise it with the new code :) >> >>> >>>> >>>> [1] https://lore.kernel.org/linux-mm/20230414130303.2345383-7-ryan.roberts@xxxxxxx/ >>>> [2] >>>> https://lore.kernel.org/linux-mm/ca649aad-7b76-4c6d-b513-26b3d58f8e68@xxxxxxxxxx/ >>>> >>>> Thanks, >>>> Ryan >> > > Thanks > Barry