Re: [PATCH] mm: Compute mTHP order efficiently

Ryan Roberts <ryan.roberts@xxxxxxx> · Tue, 17 Sep 2024 11:19:21 +0100

On 17/09/2024 10:09, Barry Song wrote:
> On Tue, Sep 17, 2024 at 4:54 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>>
>> On 17/09/2024 09:44, Barry Song wrote:
>>> On Tue, Sep 17, 2024 at 4:29 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>>>>
>>>> On 17/09/2024 04:55, Dev Jain wrote:
>>>>>
>>>>> On 9/16/24 18:54, Matthew Wilcox wrote:
>>>>>> On Fri, Sep 13, 2024 at 02:49:02PM +0530, Dev Jain wrote:
>>>>>>> We use pte_range_none() to determine whether contiguous PTEs are empty
>>>>>>> for an mTHP allocation. Instead of iterating the while loop for every
>>>>>>> order, use some information, which is the first set PTE found, from the
>>>>>>> previous iteration, to eliminate some cases. The key to understanding
>>>>>>> the correctness of the patch is that the ranges we want to examine
>>>>>>> form a strictly decreasing sequence of nested intervals.
>>>>>> This is a lot more complicated.  Do you have any numbers that indicate
>>>>>> that it's faster?  Yes, it's fewer memory references, but you've gone
>>>>>> from a simple linear scan that's easy to prefetch to an exponential scan
>>>>>> that might confuse the prefetchers.
>>>>>
>>>>> I do have some numbers, I tested with a simple program, and also used
>>>>> ktime API, with the latter, enclosing from "order = highest_order(orders)"
>>>>> till "pte_unmap(pte)" (enclosing the entire while loop), a rough average
>>>>> estimate is that without the patch, it takes 1700 ns to execute, with the
>>>>> patch, on an average it takes 80 - 100ns less. I cannot think of a good
>>>>> testing program...
>>>>>
>>>>> For the prefetching thingy, I am still doing a linear scan, and in each
>>>>> iteration, with the patch, the range I am scanning is going to strictly
>>>>> lie inside the range I would have scanned without the patch. Won't the
>>>>> compiler and the CPU still do prefetching, but on a smaller range; where
>>>>> does the prefetcher get confused? I confess, I do not understand this
>>>>> very well.
>>>>>
>>>>
>>>> A little history on this; My original "RFC v2" for mTHP included this
>>>> optimization [1], but Yu Zhou suggested dropping it to keep things simple, which
>>>> I did. Then at v8, DavidH suggested we could benefit from this sort of
>>>> optimization, but we agreed to do it later as a separate change [2]:
>>>>
>>>> """
>>>>>> Comment: Likely it would make sense to scan only once and determine the
>>>>>> "largest none range" around that address, having the largest suitable order
>>>>>> in mind.
>>>>>
>>>>> Yes, that's how I used to do it, but Yu Zhou requested simplifying to this,
>>>>> IIRC. Perhaps this an optimization opportunity for later?
>>>>
>>>> Yes, definetly.
>>>> """
>>>>
>>>> Dev independently discovered this opportunity while reading the code, and I
>>>> pointed him to the history, and suggested it would likely be worthwhile to send
>>>> a patch.
>>>>
>>>> My view is that I don't see how this can harm performance; in the common case,
>>>> when a single order is enabled, this is essentially the same as before. But when
>>>> there are multiple orders enabled, we are now just doing a single linear scan of
>>>> the ptes rather than multiple scans. There will likely be some stack accesses
>>>> interleved, but I'd be gobsmacked if the prefetchers can't tell the difference
>>>> between the stack and other areas of memory.
>>>>
>>>> Perhaps some perf numbers would help; I think the simplest way to gather some
>>>> numbers would be to create a microbenchmark to allocate a large VMA, then fault
>>>> in single pages at a given stride (say, 1 every 128K), then enable 1M, 512K,
>>>> 256K, 128K and 64K mTHP, then memset the entire VMA. It's a bit contrived, but
>>>> this patch will show improvement if the scan is currently a significant portion
>>>> of the page fault.
>>>>
>>>> If the proposed benchmark shows an improvement, and we don't see any regression
>>>> when only enabling 64K, then my vote would be to accept the patch.
>>>
>>> Agreed. The challenge now is how to benchmark this. In a system
>>> without fragmentation,
>>> we consistently succeed in allocating the largest size (1MB).
>>> Therefore, we need an
>>> environment where allocations of various sizes can fail proportionally, allowing
>>> pte_range_none() to fail on larger sizes but succeed on smaller ones.
>>
>> I don't think this is about allocation failure? It's about finding a folio order
>> that fits into the VMA without overlapping any already non-none PTEs.
>>
>>>
>>> It seems we can't micro-benchmark this with a small program.
>>
>> My proposal was to deliberately fault in a single (4K) page every 128K. That
>> will cause the scanning logic to reduce the order to the next lowest enabled
>> order and try again. So with the current code, for all orders {1M, 512K, 256K,
>> 128K} you would scan the first 128K of ptes (32 entries) then for 64K you would
>> scan 16 entries = 4 * 32 + 16 = 144 entries. For the new change, you would just
>> scan 32 entries.
> 
> I'm a bit confused. If we have a VMA from 1GB to 1GB+4MB, and even if you
> fault in a single 4KB page every 128KB with 1MB enabled, you'd still succeed
> in allocating 1MB. For the range 1GB+128KB to 1GB+1MB, wouldn't there be
> no page faults? So, you'd still end up scanning 256 entries (1MB/4KB)?

Sorry I'm not following this.

 - start with all mTHP orders disabled.
 - mmap a 1G region, which is 1G aligned.
 - write a single byte every 128K throughout the VMA.
    - causes 1 4K page to be mapped every 32 pages;
    - 1x4K-present, 31x4K-none, 1x4K-present, 31x4K-none, ...
 - enable mTHP orders {1M, 512K, 256K, 128K, 64K, 32K, 16K}
 - madvise(MADV_HUGEPAGE)
 - write single byte every 4K thoughout the VMA.
    - causes biggest possible mTHP orders to be allocated in the 31x4K holes
    - 4x4K, 1x16K, 1x32K, 1x64K, 4x4K, 1x16K, 1x32K, 1x64K

Perhaps I didn't make it clear that mTHP would be disabled during the 4K
"pre-faulting" phase, then enabled for the "post-faulting" phase?

> 
> For the entire 4MB VMA, we only have 4 page faults? For each page, we scan
> 256 entries, and there's no way to scan the next possible larger order, like
> 512KB if 1MB has succeeded?
> 
> My point is that we need to make the 1MB allocation fail in order to disrupt the
> continuity of pte_none(). otherwise, pte_range_none() will return true for the
> largest order.

But we can simulate that by putting single 4K entries strategicly in the pgtable.

> 
>>
>> Although now that I've actually written that down, it doesn't feel like a very
>> big win. Perhaps Dev can come up with an even more contrived single-page
>> pre-allocation pattern that will maximise the number of PTEs we hit with the
>> current code, and minimise it with the new code :)
>>
>>>
>>>>
>>>> [1] https://lore.kernel.org/linux-mm/20230414130303.2345383-7-ryan.roberts@xxxxxxx/
>>>> [2]
>>>> https://lore.kernel.org/linux-mm/ca649aad-7b76-4c6d-b513-26b3d58f8e68@xxxxxxxxxx/
>>>>
>>>> Thanks,
>>>> Ryan
>>
> 
> Thanks
> Barry