Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()

David Hildenbrand <david@xxxxxxxxxx> · Tue, 29 Oct 2024 15:04:00 +0100

On 29.10.24 14:04, Kefeng Wang wrote:

That should all be cleaned up ... process_huge_page() likely
shouldn't

Yes, let's fix the bug firstly,

be even consuming "nr_pages".

No sure about this part, it uses nr_pages as the end and calculate
the
'base'.

It should be using folio_nr_pages().

But process_huge_page() without an explicit folio argument, I'd like to
move the aligned address calculate into the folio_zero_user and
copy_user_large_folio(will rename it to folio_copy_user()) in the
following cleanup patches, or do it in the fix patches?

First, why does folio_zero_user() call process_huge_page() for *a small
folio*? Because we like or code to be extra complicated to understand?
Or am I missing something important?

The folio_zero_user() used for PMD-sized THP and HugeTLB before, and
after anon mTHP supported, it is used for order-2~order-PMD-order THP
and HugeTLB, so it won't process a small folio if I understand correctly.

And unfortunately neither the documentation nor the function name
expresses that :(

I'm happy to review any patches that improve the situation here :)

Actually, could we drop the process_huge_page() totally, from my
testcase[1], process_huge_page() is not better than clear/copy page
from start to last, and sequential clearing/copying maybe more
beneficial to the hardware prefetching, and is there a way to let lkp
to test to check the performance, since the process_huge_page()
was submitted by Ying, what's your opinion?

I questioned that just recently [1], and Ying assumed that it still 
applies [2].

c79b57e462b5 ("mm: hugetlb: clear target
sub-page last when clearing huge page”) documents the scenario where 
this matters -- anon-w-seq which you also run below.

If there is no performance benefit anymore, we should rip that out. But 
likely we should check on multiple micro-architectures with multiple 
#CPU configs that are relevant. c79b57e462b5 used a Xeon E5 v3 2699 with 
72 processes on 2 NUMA nodes, maybe your test environment cannot 
replicate that?

[1] 
https://lore.kernel.org/linux-mm/b8272cb4-aee8-45ad-8dff-353444b3fa74@xxxxxxxxxx/
[2] 
https://lore.kernel.org/linux-mm/878quv9lhf.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/

[1]https://lore.kernel.org/linux-mm/2524689c-08f5-446c-8cb9-924f9db0ee3a@xxxxxxxxxx/
case-anon-w-seq-mt (tried 2M PMD THP/ 64K mTHP)
case-anon-w-seq-hugetlb (2M PMD HugeTLB)

But these are sequential, not random. I'd have thought access + zeroing 
would be sequentially either way. Did you run with random access as well>

--
Cheers,

David / dhildenb