Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes: > On 2024/10/30 11:21, Huang, Ying wrote: >> Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes: >> >>> On 2024/10/30 9:04, Huang, Ying wrote: >>>> David Hildenbrand <david@xxxxxxxxxx> writes: >>>> >>>>> On 29.10.24 14:04, Kefeng Wang wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> That should all be cleaned up ... process_huge_page() likely >>>>>>>>>>>>> shouldn't >>>>>>>>>>>> >>>>>>>>>>>> Yes, let's fix the bug firstly, >>>>>>>>>>>> >>>>>>>>>>>>> be even consuming "nr_pages". >>>>>>>>>>>> >>>>>>>>>>>> No sure about this part, it uses nr_pages as the end and calculate >>>>>>>>>>>> the >>>>>>>>>>>> 'base'. >>>>>>>>>>> >>>>>>>>>>> It should be using folio_nr_pages(). >>>>>>>>>> >>>>>>>>>> But process_huge_page() without an explicit folio argument, I'd like to >>>>>>>>>> move the aligned address calculate into the folio_zero_user and >>>>>>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the >>>>>>>>>> following cleanup patches, or do it in the fix patches? >>>>>>>>> >>>>>>>>> First, why does folio_zero_user() call process_huge_page() for *a small >>>>>>>>> folio*? Because we like or code to be extra complicated to understand? >>>>>>>>> Or am I missing something important? >>>>>>>> >>>>>>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and >>>>>>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP >>>>>>>> and HugeTLB, so it won't process a small folio if I understand correctly. >>>>>>> >>>>>>> And unfortunately neither the documentation nor the function name >>>>>>> expresses that :( >>>>>>> >>>>>>> I'm happy to review any patches that improve the situation here :) >>>>>>> >>>>>> Actually, could we drop the process_huge_page() totally, from my >>>>>> testcase[1], process_huge_page() is not better than clear/copy page >>>>>> from start to last, and sequential clearing/copying maybe more >>>>>> beneficial to the hardware prefetching, and is there a way to let lkp >>>>>> to test to check the performance, since the process_huge_page() >>>>>> was submitted by Ying, what's your opinion? >>>> I don't think that it's a good idea to revert the commit without >>>> studying and root causing the issues. I can work together with you on >>>> that. If we have solid and well explained data to prove >>>> process_huge_page() isn't benefitial, we can revert the commit. >>> >>> >>> Take 'fallocate 20G' as an example, before >>> >>> Performance counter stats for 'taskset -c 10 fallocate -l 20G >>> /mnt/hugetlbfs/test': >> IIUC, fallocate will zero pages, but will not touch them at all, >> right? >> If so, no cache benefit from clearing referenced page last. > > > Yes, for this case, only clear page. >> >>> 3,118.94 msec task-clock # 0.999 CPUs >>> utilized >>> 30 context-switches # 0.010 K/sec >>> 1 cpu-migrations # 0.000 K/sec >>> 136 page-faults # 0.044 K/sec >>> 8,092,075,873 cycles # >>> 2.594 GHz (92.82%) >>> 1,624,587,663 instructions # 0.20 insn per >>> cycle (92.83%) >>> 395,341,850 branches # 126.755 M/sec >>> (92.82%) >>> 3,872,302 branch-misses # 0.98% of all >>> branches (92.83%) >>> 1,398,066,701 L1-dcache-loads # 448.251 M/sec >>> (92.82%) >>> 58,124,626 L1-dcache-load-misses # 4.16% of all >>> L1-dcache accesses (92.82%) >>> 1,032,527 LLC-loads # 0.331 M/sec >>> (92.82%) >>> 498,684 LLC-load-misses # 48.30% of all >>> LL-cache accesses (92.84%) >>> 473,689,004 L1-icache-loads # 151.875 M/sec >>> (92.82%) >>> 356,721 L1-icache-load-misses # 0.08% of all >>> L1-icache accesses (92.85%) >>> 1,947,644,987 dTLB-loads # 624.458 M/sec >>> (92.95%) >>> 10,185 dTLB-load-misses # 0.00% of all >>> dTLB cache accesses (92.96%) >>> 474,622,896 iTLB-loads # 152.174 M/sec >>> (92.95%) >>> 94 iTLB-load-misses # 0.00% of all >>> iTLB cache accesses (85.69%) >>> >>> 3.122844830 seconds time elapsed >>> >>> 0.000000000 seconds user >>> 3.107259000 seconds sys >>> >>> and after(clear from start to end) >>> >>> Performance counter stats for 'taskset -c 10 fallocate -l 20G >>> /mnt/hugetlbfs/test': >>> >>> 1,135.53 msec task-clock # 0.999 CPUs >>> utilized >>> 10 context-switches # 0.009 K/sec >>> 1 cpu-migrations # 0.001 K/sec >>> 137 page-faults # 0.121 K/sec >>> 2,946,673,587 cycles # >>> 2.595 GHz (92.67%) >>> 1,620,704,205 instructions # 0.55 insn per >>> cycle (92.61%) >>> 394,595,772 branches # 347.499 M/sec >>> (92.60%) >>> 130,756 branch-misses # 0.03% of all >>> branches (92.84%) >>> 1,396,726,689 L1-dcache-loads # 1230.022 M/sec >>> (92.96%) >>> 338,344 L1-dcache-load-misses # 0.02% of all >>> L1-dcache accesses (92.95%) >>> 111,737 LLC-loads # 0.098 M/sec >>> (92.96%) >>> 67,486 LLC-load-misses # 60.40% of all >>> LL-cache accesses (92.96%) >>> 418,198,663 L1-icache-loads # 368.285 M/sec >>> (92.96%) >>> 173,764 L1-icache-load-misses # 0.04% of all >>> L1-icache accesses (92.96%) >>> 2,203,364,632 dTLB-loads # 1940.385 M/sec >>> (92.96%) >>> 17,195 dTLB-load-misses # 0.00% of all >>> dTLB cache accesses (92.95%) >>> 418,198,365 iTLB-loads # 368.285 M/sec >>> (92.96%) >>> 79 iTLB-load-misses # 0.00% of all >>> iTLB cache accesses (85.34%) >>> >>> 1.137015760 seconds time elapsed >>> >>> 0.000000000 seconds user >>> 1.131266000 seconds sys >>> >>> The IPC improved a lot,less LLC-loads and more L1-dcache-loads, but >>> this depends on the implementation of the microarchitecture. >> Anyway, we need to avoid (or reduce at least) the pure memory >> clearing >> performance. Have you double checked whether process_huge_page() is >> inlined? Perf-profile result can be used to check this too. >> > > Yes, I'm sure the process_huge_page() is inlined. > >> When you say from start to end, you mean to use clear_gigantic_page() >> directly, or change process_huge_page() to clear page from start to end? >> > > Using clear_gigantic_page() and changing process_huge_page() to clear > page from start to end are both good for performance when sequential > clearing, but no random test so far. > >>> 1) Will test some rand test to check the different of performance as >>> David suggested. >>> >>> 2) Hope the LKP to run more tests since it is very useful(more test >>> set and different machines) >> I'm starting to use LKP to test. https://lore.kernel.org/linux-mm/20200419155856.dtwxomdkyujljdfi@xxxxxxxxxxx/ Just remembered that we have discussed a similar issue for arm64 before. Can you take a look at it? There's more discussion and tests/results in the thread, I think that may be helpful. -- Best Regards, Huang, Ying