Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes: > On 2024/10/25 10:59, Huang, Ying wrote: >> Hi, Kefeng, >> Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes: >> >>> +CC Huang Ying, >>> >>> On 2024/10/23 6:56, Barry Song wrote: >>>> On Wed, Oct 23, 2024 at 4:10 AM Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> wrote: >>>>> >>>>> >>> ... >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On 2024/10/17 23:09, Matthew Wilcox wrote: >>>>>>>>>>>>>>>>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote: >>>>>>>>>>>>>>>>>>>>> Directly use folio_zero_range() to cleanup code. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Are you sure there's no performance regression introduced by this? >>>>>>>>>>>>>>>>>>>> clear_highpage() is often optimised in ways that we can't optimise for >>>>>>>>>>>>>>>>>>>> a plain memset(). On the other hand, if the folio is large, maybe a >>>>>>>>>>>>>>>>>>>> modern CPU will be able to do better than clear-one-page-at-a-time. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Right, I missing this, clear_page might be better than memset, I change >>>>>>>>>>>>>>>>>>> this one when look at the shmem_writepage(), which already convert to >>>>>>>>>>>>>>>>>>> use folio_zero_range() from clear_highpage(), also I grep >>>>>>>>>>>>>>>>>>> folio_zero_range(), there are some other to use folio_zero_range(). >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c: folio_zero_range(folio, 0, >>>>>>>>>>>>>>>>>>> folio_size(folio)); >>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c: folio_zero_range(f, >>>>>>>>>>>>>>>>>>> 0, folio_size(f)); >>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c: folio_zero_range(f, >>>>>>>>>>>>>>>>>>> 0, folio_size(f)); >>>>>>>>>>>>>>>>>>> fs/libfs.c: folio_zero_range(folio, 0, folio_size(folio)); >>>>>>>>>>>>>>>>>>> fs/ntfs3/frecord.c: folio_zero_range(folio, 0, >>>>>>>>>>>>>>>>>>> folio_size(folio)); >>>>>>>>>>>>>>>>>>> mm/page_io.c: folio_zero_range(folio, 0, folio_size(folio)); >>>>>>>>>>>>>>>>>>> mm/shmem.c: folio_zero_range(folio, 0, folio_size(folio)); >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> IOW, what performance testing have you done with this patch? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> No performance test before, but I write a testcase, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 1) allocate N large folios (folio_alloc(PMD_ORDER)) >>>>>>>>>>>>>>>>>>> 2) then calculate the diff(us) when clear all N folios >>>>>>>>>>>>>>>>>>> clear_highpage/folio_zero_range/folio_zero_user >>>>>>>>>>>>>>>>>>> 3) release N folios >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> the result(run 5 times) shown below on my machine, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> N=1, >>>>>>>>>>>>>>>>>>> clear_highpage folio_zero_range folio_zero_user >>>>>>>>>>>>>>>>>>> 1 69 74 177 >>>>>>>>>>>>>>>>>>> 2 57 62 168 >>>>>>>>>>>>>>>>>>> 3 54 58 234 >>>>>>>>>>>>>>>>>>> 4 54 58 157 >>>>>>>>>>>>>>>>>>> 5 56 62 148 >>>>>>>>>>>>>>>>>>> avg 58 62.8 176.8 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> N=100 >>>>>>>>>>>>>>>>>>> clear_highpage folio_zero_range folio_zero_user >>>>>>>>>>>>>>>>>>> 1 11015 11309 32833 >>>>>>>>>>>>>>>>>>> 2 10385 11110 49751 >>>>>>>>>>>>>>>>>>> 3 10369 11056 33095 >>>>>>>>>>>>>>>>>>> 4 10332 11017 33106 >>>>>>>>>>>>>>>>>>> 5 10483 11000 49032 >>>>>>>>>>>>>>>>>>> avg 10516.8 11098.4 39563.4 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> N=512 >>>>>>>>>>>>>>>>>>> clear_highpage folio_zero_range folio_zero_user >>>>>>>>>>>>>>>>>>> 1 55560 60055 156876 >>>>>>>>>>>>>>>>>>> 2 55485 60024 157132 >>>>>>>>>>>>>>>>>>> 3 55474 60129 156658 >>>>>>>>>>>>>>>>>>> 4 55555 59867 157259 >>>>>>>>>>>>>>>>>>> 5 55528 59932 157108 >>>>>>>>>>>>>>>>>>> avg 55520.4 60001.4 157006.6 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> folio_zero_user with many cond_resched(), so time fluctuates a lot, >>>>>>>>>>>>>>>>>>> clear_highpage is better folio_zero_range as you said. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Maybe add a new helper to convert all folio_zero_range(folio, 0, >>>>>>>>>>>>>>>>>>> folio_size(folio)) >>>>>>>>>>>>>>>>>>> to use clear_highpage + flush_dcache_folio? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> If this also improves performance for other existing callers of >>>>>>>>>>>>>>>>>> folio_zero_range(), then that's a positive outcome. >>>>>>>>>>>>>>>>> >>>>>>>>>>>> ... >>>>>>>>>>>> >>>>>>>>>>>>>>> hi Kefeng, >>>>>>>>>>>>>>> what's your point? providing a helper like clear_highfolio() or similar? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Yes, from above test, using clear_highpage/flush_dcache_folio is better >>>>>>>>>>>>>> than using folio_zero_range() for folio zero(especially for large >>>>>>>>>>>>>> folio), so I'd like to add a new helper, maybe name it folio_zero() >>>>>>>>>>>>>> since it zero the whole folio. >>>>>>>>>>>>> >>>>>>>>>>>>> we already have a helper like folio_zero_user()? >>>>>>>>>>>>> it is not good enough? >>>>>>>>>>>> >>>>>>>>>>>> Since it is with many cond_resched(), the performance is worst... >>>>>>>>>>> >>>>>>>>>>> Not exactly? It should have zero cost for a preemptible kernel. >>>>>>>>>>> For a non-preemptible kernel, it helps avoid clearing the folio >>>>>>>>>>> from occupying the CPU and starving other processes, right? >>>>>>>>>> >>>>>>>>>> --- a/mm/shmem.c >>>>>>>>>> +++ b/mm/shmem.c >>>>>>>>>> >>>>>>>>>> @@ -2393,10 +2393,7 @@ static int shmem_get_folio_gfp(struct inode >>>>>>>>>> *inode, pgoff_t index, >>>>>>>>>> * it now, lest undo on failure cancel our earlier guarantee. >>>>>>>>>> */ >>>>>>>>>> >>>>>>>>>> if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) { >>>>>>>>>> - long i, n = folio_nr_pages(folio); >>>>>>>>>> - >>>>>>>>>> - for (i = 0; i < n; i++) >>>>>>>>>> - clear_highpage(folio_page(folio, i)); >>>>>>>>>> + folio_zero_user(folio, vmf->address); >>>>>>>>>> flush_dcache_folio(folio); >>>>>>>>>> folio_mark_uptodate(folio); >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> Do we perform better or worse with the following? >>>>>>>>> >>>>>>>>> Here is for SGP_FALLOC, vmf = NULL, we could use folio_zero_user(folio, >>>>>>>>> 0), I think the performance is worse, will retest once I can access >>>>>>>>> hardware. >>>>>>>> >>>>>>>> Perhaps, since the current code uses clear_hugepage(). Does using >>>>>>>> index << PAGE_SHIFT as the addr_hint offer any benefit? >>>>>>>> >>>>>>> >>>>>>> when use folio_zero_user(), the performance is vary bad with above >>>>>>> fallocate test(mount huge=always), >>>>>>> >>>>>>> folio_zero_range clear_highpage folio_zero_user >>>>>>> real 0m1.214s 0m1.111s 0m3.159s >>>>>>> user 0m0.000s 0m0.000s 0m0.000s >>>>>>> sys 0m1.210s 0m1.109s 0m3.152s >>>>>>> >>>>>>> I tried with addr_hint = 0/index << PAGE_SHIFT, no obvious different. >>>>>> >>>>>> Interesting. Does your kernel have preemption disabled or >>>>>> preemption_debug enabled? >>>>> >>>>> ARM64 server, CONFIG_PREEMPT_NONE=y >>>> this explains why the performance is much worse. >>>> >>>>> >>>>>> >>>>>> If not, it makes me wonder whether folio_zero_user() in >>>>>> alloc_anon_folio() is actually improving performance as expected, >>>>>> compared to the simpler folio_zero() you plan to implement. :-) >>>>> >>>>> Yes, maybe, the folio_zero_user(was clear_huge_page) is from >>>>> 47ad8475c000 ("thp: clear_copy_huge_page"), so original clear_huge_page >>>>> is used in HugeTLB, clear PUD size maybe spend many time, but for PMD or >>>>> other size of large folio, cond_resched is not necessary since we >>>>> already have some folio_zero_range() to clear large folio, and no issue >>>>> was reported. >>>> probably worth an optimization. calling cond_resched() for each page >>>> seems too aggressive and useless. >>> >>> After some test, I think the cond_resched() is not the root cause, >>> no performance gained with batched cond_resched(), even I kill >>> cond_resched() from process_huge_page, no improvement. >>> >>> But when I unconditionally use clear_gigantic_page() in >>> folio_zero_user(patched), there is big improvement with above >>> fallocate on tmpfs(mount huge=always), also I test some other testcase, >>> >>> >>> 1) case-anon-w-seq-mt: (2M PMD THP) >>> >>> base: >>> real 0m2.490s 0m2.254s 0m2.272s >>> user 1m59.980s 2m23.431s 2m18.739s >>> sys 1m3.675s 1m15.462s 1m15.030s >>> >>> patched: >>> real 0m2.234s 0m2.225s 0m2.159s >>> user 2m56.105s 2m57.117s 3m0.489s >>> sys 0m17.064s 0m17.564s 0m16.150s >>> >>> Patched kernel win on sys and bad in user, but real is almost same, >>> maybe a little better than base. >> We can find user time difference. That means the original cache hot >> behavior still applies on your system. >> However, it appears that the performance to clear page from end to >> begin >> is really bad on your system. >> So, I suggest to revise the current implementation to use sequential >> clearing as much as possible. >> > > I test case-anon-cow-seq-hugetlb for copy_user_large_folio() > > base: > real 0m6.259s 0m6.197s 0m6.316s > user 1m31.176s 1m27.195s 1m29.594s > sys 7m44.199s 7m51.490s 8m21.149s > > patched(use copy_user_gigantic_page for 2M hugetlb too) > real 0m3.182s 0m3.002s 0m2.963s > user 1m19.456s 1m3.107s 1m6.447s > sys 2m59.222s 3m10.899s 3m1.027s > > and sequential copy is better than the current implementation, > so I will use sequential clear and copy. Sorry, it appears that you misunderstanding my suggestion. I suggest to revise process_huge_page() to use more sequential memory clearing and copying to improve its performance on your platform. -- Best Regards, Huang, Ying >>> 2) case-anon-w-seq-hugetlb:(2M PMD HugeTLB) >>> >>> base: >>> real 0m5.175s 0m5.117s 0m4.856s >>> user 5m15.943s 5m7.567s 4m29.273s >>> sys 2m38.503s 2m21.949s 2m21.252s >>> >>> patched: >>> real 0m4.966s 0m4.841s 0m4.561s >>> user 6m30.123s 6m9.516s 5m49.733s >>> sys 0m58.503s 0m47.847s 0m46.785s >>> >>> >>> This case is similar to the case1. >>> >>> 3) fallocate hugetlb 20G (2M PMD HugeTLB) >>> >>> base: >>> real 0m3.016s 0m3.019s 0m3.018s >>> user 0m0.000s 0m0.000s 0m0.000s >>> sys 0m3.009s 0m3.012s 0m3.010s >>> >>> patched: >>> >>> real 0m1.136s 0m1.136s 0m1.136s >>> user 0m0.000s 0m0.000s 0m0.004s >>> sys 0m1.133s 0m1.133s 0m1.129s >>> >>> >>> There is big win on patched kernel, and it is similar to above tmpfs >>> test, so maybe we could revert the commit c79b57e462b5 ("mm: hugetlb: >>> clear target sub-page last when clearing huge page"). -- Best Regards, Huang, Ying