Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes: > On 2024/10/31 16:39, Huang, Ying wrote: >> Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes: >> [snip] >>> >>>>> 1) Will test some rand test to check the different of performance as >>>>> David suggested.>>>> >>>>> 2) Hope the LKP to run more tests since it is very useful(more test >>>>> set and different machines) >>>> I'm starting to use LKP to test. >>> >>> Greet. > > > Sorry for the late, > >> I have run some tests with LKP to test. >> Firstly, there's almost no measurable difference between clearing >> pages >> from start to end or from end to start on Intel server CPU. I guess >> that there's some similar optimization for both direction. >> For multiple processes (same as logical CPU number) >> vm-scalability/anon-w-seq test case, the benchmark score increases >> about 22.4%. > > So process_huge_page is better than clear_gigantic_page() on Intel? For vm-scalability/anon-w-seq test case, it is. Because the performance of forward and backward clearing is almost same, and the user space accessing has cache-hot benefit. > Could you test the following case on x86? > echo 10240 > > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages > mkdir -p /hugetlbfs/ > mount none /hugetlbfs/ -t hugetlbfs > rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate > -d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G > /hugetlbfs/test It's not trivial for me to do this test. Because 0day wraps test cases. Do you know which existing test cases provide this? For example, in vm-scalability? >> For multiple processes vm-scalability/anon-w-rand test case, no >> measurable difference for benchmark score. >> So, the optimization helps sequential workload mainly. >> In summary, on x86, process_huge_page() will not introduce any >> regression. And it helps some workload. >> However, on ARM64, it does introduce some regression for clearing >> pages >> from end to start. That needs to be addressed. I guess that the >> regression can be resolved via using more clearing from start to end >> (but not all). For example, can you take a look at the patch below? >> Which uses the similar framework as before, but clear each small trunk >> (mpage) from start to end. You can adjust MPAGE_NRPAGES to check when >> the regression can be restored. >> WARNING: the patch is only build tested. > > > Base: baseline > Change1: using clear_gigantic_page() for 2M PMD > Change2: your patch with MPAGE_NRPAGES=16 > Change3: Case3 + fix[1] What is case3? > Change4: your patch with MPAGE_NRPAGES=64 + fix[1] > > 1. For rand write, > case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference > > 2. For seq write, > > 1) case-anon-w-seq-mt: Can you try case-anon-w-seq? That may be more stable. > base: > real 0m2.490s 0m2.254s 0m2.272s > user 1m59.980s 2m23.431s 2m18.739s > sys 1m3.675s 1m15.462s 1m15.030s > > Change1: > real 0m2.234s 0m2.225s 0m2.159s > user 2m56.105s 2m57.117s 3m0.489s > sys 0m17.064s 0m17.564s 0m16.150s > > Change2: > real 0m2.244s 0m2.384s 0m2.370s > user 2m39.413s 2m41.990s 2m42.229s > sys 0m19.826s 0m18.491s 0m18.053s It appears strange. There's no much cache hot benefit even if we clear pages from end to begin (with larger chunk). However, sys time improves a lot. This shows clearing page with large chunk helps on ARM64. > Change3: // best performance > real 0m2.155s 0m2.204s 0m2.194s > user 3m2.640s 2m55.837s 3m0.902s > sys 0m17.346s 0m17.630s 0m18.197s > > Change4: > real 0m2.287s 0m2.377s 0m2.284s > user 2m37.030s 2m52.868s 3m17.593s > sys 0m15.445s 0m34.430s 0m45.224s Change4 is essentially same as Change1. I don't know why they are different. Is there some large variation among run to run? Can you further optimize the prototype patch below? I think that it has potential to fix your issue. > 2) case-anon-w-seq-hugetlb > very similar 1), Change4 slightly better than Change3, but not big > different. > > 3) hugetlbfs fallocate 20G > Change1(0m1.136s) = Change3(0m1.136s) = Change4(0m1.135s) < > Change2(0m1.275s) < base(0m3.016s) > > In summary, the Change3 is best and Change1 is good on my arm64 machine. > >> Best Regards, >> Huang, Ying >> -----------------------------------8<---------------------------------------- >> From 406bcd1603987fdd7130d2df6f7d4aee4cc6b978 Mon Sep 17 00:00:00 2001 >> From: Huang Ying <ying.huang@xxxxxxxxx> >> Date: Thu, 31 Oct 2024 11:13:57 +0800 >> Subject: [PATCH] mpage clear >> --- >> mm/memory.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++--- >> 1 file changed, 67 insertions(+), 3 deletions(-) >> diff --git a/mm/memory.c b/mm/memory.c >> index 3ccee51adfbb..1fdc548c4275 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -6769,6 +6769,68 @@ static inline int process_huge_page( >> return 0; >> } >> +#define MPAGE_NRPAGES (1<<4) >> +#define MPAGE_SIZE (PAGE_SIZE * MPAGE_NRPAGES) >> +static inline int clear_huge_page( >> + unsigned long addr_hint, unsigned int nr_pages, >> + int (*process_subpage)(unsigned long addr, int idx, void *arg), >> + void *arg) >> +{ >> + int i, n, base, l, ret; >> + unsigned long addr = addr_hint & >> + ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1); >> + unsigned long nr_mpages = ((unsigned long)nr_pages << PAGE_SHIFT) / MPAGE_SIZE; >> + >> + /* Process target subpage last to keep its cache lines hot */ >> + might_sleep(); >> + n = (addr_hint - addr) / MPAGE_SIZE; >> + if (2 * n <= nr_mpages) { >> + /* If target subpage in first half of huge page */ >> + base = 0; >> + l = n; >> + /* Process subpages at the end of huge page */ >> + for (i = nr_mpages - 1; i >= 2 * n; i--) { >> + cond_resched(); >> + ret = process_subpage(addr + i * MPAGE_SIZE, >> + i * MPAGE_NRPAGES, arg); >> + if (ret) >> + return ret; >> + } >> + } else { >> + /* If target subpage in second half of huge page */ >> + base = nr_mpages - 2 * (nr_mpages - n); >> + l = nr_mpages - n; >> + /* Process subpages at the begin of huge page */ >> + for (i = 0; i < base; i++) { >> + cond_resched(); >> + ret = process_subpage(addr + i * MPAGE_SIZE, >> + i * MPAGE_NRPAGES, arg); >> + if (ret) >> + return ret; >> + } >> + } >> + /* >> + * Process remaining subpages in left-right-left-right pattern >> + * towards the target subpage >> + */ >> + for (i = 0; i < l; i++) { >> + int left_idx = base + i; >> + int right_idx = base + 2 * l - 1 - i; >> + >> + cond_resched(); >> + ret = process_subpage(addr + left_idx * MPAGE_SIZE, >> + left_idx * MPAGE_NRPAGES, arg); >> + if (ret) >> + return ret; >> + cond_resched(); >> + ret = process_subpage(addr + right_idx * MPAGE_SIZE, >> + right_idx * MPAGE_NRPAGES, arg); >> + if (ret) >> + return ret; >> + } >> + return 0; >> +} >> + >> static void clear_gigantic_page(struct folio *folio, unsigned long addr, >> unsigned int nr_pages) >> { >> @@ -6784,8 +6846,10 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr, >> static int clear_subpage(unsigned long addr, int idx, void *arg) >> { >> struct folio *folio = arg; >> + int i; >> - clear_user_highpage(folio_page(folio, idx), addr); >> + for (i = 0; i < MPAGE_NRPAGES; i++) >> + clear_user_highpage(folio_page(folio, idx + i), addr + i * PAGE_SIZE); >> return 0; >> } >> @@ -6798,10 +6862,10 @@ void folio_zero_user(struct folio *folio, >> unsigned long addr_hint) >> { >> unsigned int nr_pages = folio_nr_pages(folio); >> - if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) >> + if (unlikely(nr_pages != HPAGE_PMD_NR)) >> clear_gigantic_page(folio, addr_hint, nr_pages); >> else >> - process_huge_page(addr_hint, nr_pages, clear_subpage, folio); >> + clear_huge_page(addr_hint, nr_pages, clear_subpage, folio); >> } >> static int copy_user_gigantic_page(struct folio *dst, struct >> folio *src, > > > [1] fix patch > > diff --git a/mm/memory.c b/mm/memory.c > index b22d4b83295b..aee99ede0c4f 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -6816,7 +6816,7 @@ static inline int clear_huge_page( > base = 0; > l = n; > /* Process subpages at the end of huge page */ > - for (i = nr_mpages - 1; i >= 2 * n; i--) { > + for (i = 2 * n; i < nr_mpages; i++) { > cond_resched(); > ret = process_subpage(addr + i * MPAGE_SIZE, > i * MPAGE_NRPAGES, arg);