Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes: > On 2024/10/30 9:04, Huang, Ying wrote: >> David Hildenbrand <david@xxxxxxxxxx> writes: >> >>> On 29.10.24 14:04, Kefeng Wang wrote: >>>>>>>>>>> >>>>>>>>>>> That should all be cleaned up ... process_huge_page() likely >>>>>>>>>>> shouldn't >>>>>>>>>> >>>>>>>>>> Yes, let's fix the bug firstly, >>>>>>>>>> >>>>>>>>>>> be even consuming "nr_pages". >>>>>>>>>> >>>>>>>>>> No sure about this part, it uses nr_pages as the end and calculate >>>>>>>>>> the >>>>>>>>>> 'base'. >>>>>>>>> >>>>>>>>> It should be using folio_nr_pages(). >>>>>>>> >>>>>>>> But process_huge_page() without an explicit folio argument, I'd like to >>>>>>>> move the aligned address calculate into the folio_zero_user and >>>>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the >>>>>>>> following cleanup patches, or do it in the fix patches? >>>>>>> >>>>>>> First, why does folio_zero_user() call process_huge_page() for *a small >>>>>>> folio*? Because we like or code to be extra complicated to understand? >>>>>>> Or am I missing something important? >>>>>> >>>>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and >>>>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP >>>>>> and HugeTLB, so it won't process a small folio if I understand correctly. >>>>> >>>>> And unfortunately neither the documentation nor the function name >>>>> expresses that :( >>>>> >>>>> I'm happy to review any patches that improve the situation here :) >>>>> >>>> Actually, could we drop the process_huge_page() totally, from my >>>> testcase[1], process_huge_page() is not better than clear/copy page >>>> from start to last, and sequential clearing/copying maybe more >>>> beneficial to the hardware prefetching, and is there a way to let lkp >>>> to test to check the performance, since the process_huge_page() >>>> was submitted by Ying, what's your opinion? >> I don't think that it's a good idea to revert the commit without >> studying and root causing the issues. I can work together with you on >> that. If we have solid and well explained data to prove >> process_huge_page() isn't benefitial, we can revert the commit. > > > Take 'fallocate 20G' as an example, before > > Performance counter stats for 'taskset -c 10 fallocate -l 20G > /mnt/hugetlbfs/test': IIUC, fallocate will zero pages, but will not touch them at all, right? If so, no cache benefit from clearing referenced page last. > 3,118.94 msec task-clock # 0.999 CPUs > utilized > 30 context-switches # 0.010 K/sec > 1 cpu-migrations # 0.000 K/sec > 136 page-faults # 0.044 K/sec > 8,092,075,873 cycles # > 2.594 GHz (92.82%) > 1,624,587,663 instructions # 0.20 insn per > cycle (92.83%) > 395,341,850 branches # 126.755 M/sec > (92.82%) > 3,872,302 branch-misses # 0.98% of all > branches (92.83%) > 1,398,066,701 L1-dcache-loads # 448.251 M/sec > (92.82%) > 58,124,626 L1-dcache-load-misses # 4.16% of all > L1-dcache accesses (92.82%) > 1,032,527 LLC-loads # 0.331 M/sec > (92.82%) > 498,684 LLC-load-misses # 48.30% of all > LL-cache accesses (92.84%) > 473,689,004 L1-icache-loads # 151.875 M/sec > (92.82%) > 356,721 L1-icache-load-misses # 0.08% of all > L1-icache accesses (92.85%) > 1,947,644,987 dTLB-loads # 624.458 M/sec > (92.95%) > 10,185 dTLB-load-misses # 0.00% of all > dTLB cache accesses (92.96%) > 474,622,896 iTLB-loads # 152.174 M/sec > (92.95%) > 94 iTLB-load-misses # 0.00% of all > iTLB cache accesses (85.69%) > > 3.122844830 seconds time elapsed > > 0.000000000 seconds user > 3.107259000 seconds sys > > and after(clear from start to end) > > Performance counter stats for 'taskset -c 10 fallocate -l 20G > /mnt/hugetlbfs/test': > > 1,135.53 msec task-clock # 0.999 CPUs > utilized > 10 context-switches # 0.009 K/sec > 1 cpu-migrations # 0.001 K/sec > 137 page-faults # 0.121 K/sec > 2,946,673,587 cycles # > 2.595 GHz (92.67%) > 1,620,704,205 instructions # 0.55 insn per > cycle (92.61%) > 394,595,772 branches # 347.499 M/sec > (92.60%) > 130,756 branch-misses # 0.03% of all > branches (92.84%) > 1,396,726,689 L1-dcache-loads # 1230.022 M/sec > (92.96%) > 338,344 L1-dcache-load-misses # 0.02% of all > L1-dcache accesses (92.95%) > 111,737 LLC-loads # 0.098 M/sec > (92.96%) > 67,486 LLC-load-misses # 60.40% of all > LL-cache accesses (92.96%) > 418,198,663 L1-icache-loads # 368.285 M/sec > (92.96%) > 173,764 L1-icache-load-misses # 0.04% of all > L1-icache accesses (92.96%) > 2,203,364,632 dTLB-loads # 1940.385 M/sec > (92.96%) > 17,195 dTLB-load-misses # 0.00% of all > dTLB cache accesses (92.95%) > 418,198,365 iTLB-loads # 368.285 M/sec > (92.96%) > 79 iTLB-load-misses # 0.00% of all > iTLB cache accesses (85.34%) > > 1.137015760 seconds time elapsed > > 0.000000000 seconds user > 1.131266000 seconds sys > > The IPC improved a lot,less LLC-loads and more L1-dcache-loads, but > this depends on the implementation of the microarchitecture. Anyway, we need to avoid (or reduce at least) the pure memory clearing performance. Have you double checked whether process_huge_page() is inlined? Perf-profile result can be used to check this too. When you say from start to end, you mean to use clear_gigantic_page() directly, or change process_huge_page() to clear page from start to end? > 1) Will test some rand test to check the different of performance as > David suggested. > > 2) Hope the LKP to run more tests since it is very useful(more test > set and different machines) I'm starting to use LKP to test. -- Best Regards, Huang, Ying > >> >>> I questioned that just recently [1], and Ying assumed that it still >>> applies [2]. >>> >>> c79b57e462b5 ("mm: hugetlb: clear target >>> sub-page last when clearing huge page”) documents the scenario where >>> this matters -- anon-w-seq which you also run below. >>> >>> If there is no performance benefit anymore, we should rip that >>> out. But likely we should check on multiple micro-architectures with >>> multiple #CPU configs that are relevant. c79b57e462b5 used a Xeon E5 >>> v3 2699 with 72 processes on 2 NUMA nodes, maybe your test environment >>> cannot replicate that?>> >>> >>> [1] >>> https://lore.kernel.org/linux-mm/b8272cb4-aee8-45ad-8dff-353444b3fa74@xxxxxxxxxx/ >>> [2] >>> https://lore.kernel.org/linux-mm/878quv9lhf.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ >>> >>>> [1]https://lore.kernel.org/linux-mm/2524689c-08f5-446c-8cb9-924f9db0ee3a@xxxxxxxxxx/ >>>> case-anon-w-seq-mt (tried 2M PMD THP/ 64K mTHP) >>>> case-anon-w-seq-hugetlb (2M PMD HugeTLB) >>> >>> But these are sequential, not random. I'd have thought access + >>> zeroing would be sequentially either way. Did you run with random >>> access as well> > > Will do. >> > -- >> Best Regards, >> Huang, Ying >>