Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes: > On 2024/11/1 16:16, Huang, Ying wrote: >> Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes: >> >>> On 2024/10/31 16:39, Huang, Ying wrote: >>>> Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes: >>>> [snip] >>>>> >>>>>>> 1) Will test some rand test to check the different of performance as >>>>>>> David suggested.>>>> >>>>>>> 2) Hope the LKP to run more tests since it is very useful(more test >>>>>>> set and different machines) >>>>>> I'm starting to use LKP to test. >>>>> >>>>> Greet. >>> >>> >>> Sorry for the late, >>> >>>> I have run some tests with LKP to test. >>>> Firstly, there's almost no measurable difference between clearing >>>> pages >>>> from start to end or from end to start on Intel server CPU. I guess >>>> that there's some similar optimization for both direction. >>>> For multiple processes (same as logical CPU number) >>>> vm-scalability/anon-w-seq test case, the benchmark score increases >>>> about 22.4%. >>> >>> So process_huge_page is better than clear_gigantic_page() on Intel? >> For vm-scalability/anon-w-seq test case, it is. Because the >> performance >> of forward and backward clearing is almost same, and the user space >> accessing has cache-hot benefit. >> >>> Could you test the following case on x86? >>> echo 10240 > >>> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages >>> mkdir -p /hugetlbfs/ >>> mount none /hugetlbfs/ -t hugetlbfs >>> rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate >>> -d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G >>> /hugetlbfs/test >> It's not trivial for me to do this test. Because 0day wraps test >> cases. >> Do you know which existing test cases provide this? For example, in >> vm-scalability? > > I don't know the public fallocate test, I will try to find a intel > machine to test this case. I don't expect it to change much, because we have observed that the performance of forward and backward clearing is similar on Intel. >> >>>> For multiple processes vm-scalability/anon-w-rand test case, no >>>> measurable difference for benchmark score. >>>> So, the optimization helps sequential workload mainly. >>>> In summary, on x86, process_huge_page() will not introduce any >>>> regression. And it helps some workload. >>>> However, on ARM64, it does introduce some regression for clearing >>>> pages >>>> from end to start. That needs to be addressed. I guess that the >>>> regression can be resolved via using more clearing from start to end >>>> (but not all). For example, can you take a look at the patch below? >>>> Which uses the similar framework as before, but clear each small trunk >>>> (mpage) from start to end. You can adjust MPAGE_NRPAGES to check when >>>> the regression can be restored. >>>> WARNING: the patch is only build tested. >>> >>> >>> Base: baseline >>> Change1: using clear_gigantic_page() for 2M PMD >>> Change2: your patch with MPAGE_NRPAGES=16 >>> Change3: Case3 + fix[1] >> What is case3? > > Oh, it is Change2. Got it. >> >>> Change4: your patch with MPAGE_NRPAGES=64 + fix[1] >>> >>> 1. For rand write, >>> case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference >>> >>> 2. For seq write, >>> >>> 1) case-anon-w-seq-mt: >> Can you try case-anon-w-seq? That may be more stable. >> >>> base: >>> real 0m2.490s 0m2.254s 0m2.272s >>> user 1m59.980s 2m23.431s 2m18.739s >>> sys 1m3.675s 1m15.462s 1m15.030s >>> >>> Change1: >>> real 0m2.234s 0m2.225s 0m2.159s >>> user 2m56.105s 2m57.117s 3m0.489s >>> sys 0m17.064s 0m17.564s 0m16.150s >>> >>> Change2: >>> real 0m2.244s 0m2.384s 0m2.370s >>> user 2m39.413s 2m41.990s 2m42.229s >>> sys 0m19.826s 0m18.491s 0m18.053s >> It appears strange. There's no much cache hot benefit even if we >> clear >> pages from end to begin (with larger chunk). >> However, sys time improves a lot. This shows clearing page with >> large >> chunk helps on ARM64. >> >>> Change3: // best performance >>> real 0m2.155s 0m2.204s 0m2.194s >>> user 3m2.640s 2m55.837s 3m0.902s >>> sys 0m17.346s 0m17.630s 0m18.197s >>> >>> Change4: >>> real 0m2.287s 0m2.377s 0m2.284s >>> user 2m37.030s 2m52.868s 3m17.593s >>> sys 0m15.445s 0m34.430s 0m45.224s >> Change4 is essentially same as Change1. I don't know why they are >> different. Is there some large variation among run to run? > > As above shown, I test three times, the test results are relatively > stable, at least for real, I will try case-anon-w-seq. Can you also show the score of vm-scalability? TBH, I cannot understand your results. For example, why there are measurable difference between Change3 and Change4? In both cases, the kernel clears pages from start to end. >> Can you further optimize the prototype patch below? I think that it >> has >> potential to fix your issue. > > Yes, thanks for you helper, but this will make process_huge_page() a > little more complicated :) IMHO, we should try to root cause it, then try to find the proper solution and optimize (simplifies) it. -- Best Regards, Huang, Ying >> >>> 2) case-anon-w-seq-hugetlb >>> very similar 1), Change4 slightly better than Change3, but not big >>> different. >>> >>> 3) hugetlbfs fallocate 20G >>> Change1(0m1.136s) = Change3(0m1.136s) = Change4(0m1.135s) < >>> Change2(0m1.275s) < base(0m3.016s) >>> >>> In summary, the Change3 is best and Change1 is good on my arm64 machine. >>> >>>> Best Regards, >>>> Huang, Ying >>>> -----------------------------------8<---------------------------------------- >>>> From 406bcd1603987fdd7130d2df6f7d4aee4cc6b978 Mon Sep 17 00:00:00 2001 >>>> From: Huang Ying <ying.huang@xxxxxxxxx> >>>> Date: Thu, 31 Oct 2024 11:13:57 +0800 >>>> Subject: [PATCH] mpage clear >>>> --- >>>> mm/memory.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++--- >>>> 1 file changed, 67 insertions(+), 3 deletions(-) >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index 3ccee51adfbb..1fdc548c4275 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -6769,6 +6769,68 @@ static inline int process_huge_page( >>>> return 0; >>>> } >>>> +#define MPAGE_NRPAGES (1<<4) >>>> +#define MPAGE_SIZE (PAGE_SIZE * MPAGE_NRPAGES) >>>> +static inline int clear_huge_page( >>>> + unsigned long addr_hint, unsigned int nr_pages, >>>> + int (*process_subpage)(unsigned long addr, int idx, void *arg), >>>> + void *arg) >>>> +{ >>>> + int i, n, base, l, ret; >>>> + unsigned long addr = addr_hint & >>>> + ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1); >>>> + unsigned long nr_mpages = ((unsigned long)nr_pages << PAGE_SHIFT) / MPAGE_SIZE; >>>> + >>>> + /* Process target subpage last to keep its cache lines hot */ >>>> + might_sleep(); >>>> + n = (addr_hint - addr) / MPAGE_SIZE; >>>> + if (2 * n <= nr_mpages) { >>>> + /* If target subpage in first half of huge page */ >>>> + base = 0; >>>> + l = n; >>>> + /* Process subpages at the end of huge page */ >>>> + for (i = nr_mpages - 1; i >= 2 * n; i--) { >>>> + cond_resched(); >>>> + ret = process_subpage(addr + i * MPAGE_SIZE, >>>> + i * MPAGE_NRPAGES, arg); >>>> + if (ret) >>>> + return ret; >>>> + } >>>> + } else { >>>> + /* If target subpage in second half of huge page */ >>>> + base = nr_mpages - 2 * (nr_mpages - n); >>>> + l = nr_mpages - n; >>>> + /* Process subpages at the begin of huge page */ >>>> + for (i = 0; i < base; i++) { >>>> + cond_resched(); >>>> + ret = process_subpage(addr + i * MPAGE_SIZE, >>>> + i * MPAGE_NRPAGES, arg); >>>> + if (ret) >>>> + return ret; >>>> + } >>>> + } >>>> + /* >>>> + * Process remaining subpages in left-right-left-right pattern >>>> + * towards the target subpage >>>> + */ >>>> + for (i = 0; i < l; i++) { >>>> + int left_idx = base + i; >>>> + int right_idx = base + 2 * l - 1 - i; >>>> + >>>> + cond_resched(); >>>> + ret = process_subpage(addr + left_idx * MPAGE_SIZE, >>>> + left_idx * MPAGE_NRPAGES, arg); >>>> + if (ret) >>>> + return ret; >>>> + cond_resched(); >>>> + ret = process_subpage(addr + right_idx * MPAGE_SIZE, >>>> + right_idx * MPAGE_NRPAGES, arg); >>>> + if (ret) >>>> + return ret; >>>> + } >>>> + return 0; >>>> +} >>>> + >>>> static void clear_gigantic_page(struct folio *folio, unsigned long addr, >>>> unsigned int nr_pages) >>>> { >>>> @@ -6784,8 +6846,10 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr, >>>> static int clear_subpage(unsigned long addr, int idx, void *arg) >>>> { >>>> struct folio *folio = arg; >>>> + int i; >>>> - clear_user_highpage(folio_page(folio, idx), addr); >>>> + for (i = 0; i < MPAGE_NRPAGES; i++) >>>> + clear_user_highpage(folio_page(folio, idx + i), addr + i * PAGE_SIZE); >>>> return 0; >>>> } >>>> @@ -6798,10 +6862,10 @@ void folio_zero_user(struct folio *folio, >>>> unsigned long addr_hint) >>>> { >>>> unsigned int nr_pages = folio_nr_pages(folio); >>>> - if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) >>>> + if (unlikely(nr_pages != HPAGE_PMD_NR)) >>>> clear_gigantic_page(folio, addr_hint, nr_pages); >>>> else >>>> - process_huge_page(addr_hint, nr_pages, clear_subpage, folio); >>>> + clear_huge_page(addr_hint, nr_pages, clear_subpage, folio); >>>> } >>>> static int copy_user_gigantic_page(struct folio *dst, struct >>>> folio *src, >>> >>> >>> [1] fix patch >>> >>> diff --git a/mm/memory.c b/mm/memory.c >>> index b22d4b83295b..aee99ede0c4f 100644 >>> --- a/mm/memory.c >>> +++ b/mm/memory.c >>> @@ -6816,7 +6816,7 @@ static inline int clear_huge_page( >>> base = 0; >>> l = n; >>> /* Process subpages at the end of huge page */ >>> - for (i = nr_mpages - 1; i >= 2 * n; i--) { >>> + for (i = 2 * n; i < nr_mpages; i++) { >>> cond_resched(); >>> ret = process_subpage(addr + i * MPAGE_SIZE, >>> i * MPAGE_NRPAGES, arg); >>