On 2024/11/1 16:16, Huang, Ying wrote:
Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes:
On 2024/10/31 16:39, Huang, Ying wrote:
Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes:
[snip]
1) Will test some rand test to check the different of performance as
David suggested.>>>>
2) Hope the LKP to run more tests since it is very useful(more test
set and different machines)
I'm starting to use LKP to test.
Greet.
Sorry for the late,
I have run some tests with LKP to test.
Firstly, there's almost no measurable difference between clearing
pages
from start to end or from end to start on Intel server CPU. I guess
that there's some similar optimization for both direction.
For multiple processes (same as logical CPU number)
vm-scalability/anon-w-seq test case, the benchmark score increases
about 22.4%.
So process_huge_page is better than clear_gigantic_page() on Intel?
For vm-scalability/anon-w-seq test case, it is. Because the performance
of forward and backward clearing is almost same, and the user space
accessing has cache-hot benefit.
Could you test the following case on x86?
echo 10240 >
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
mkdir -p /hugetlbfs/
mount none /hugetlbfs/ -t hugetlbfs
rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate
-d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G
/hugetlbfs/test
It's not trivial for me to do this test. Because 0day wraps test cases.
Do you know which existing test cases provide this? For example, in
vm-scalability?
I don't know the public fallocate test, I will try to find a intel
machine to test this case.
For multiple processes vm-scalability/anon-w-rand test case, no
measurable difference for benchmark score.
So, the optimization helps sequential workload mainly.
In summary, on x86, process_huge_page() will not introduce any
regression. And it helps some workload.
However, on ARM64, it does introduce some regression for clearing
pages
from end to start. That needs to be addressed. I guess that the
regression can be resolved via using more clearing from start to end
(but not all). For example, can you take a look at the patch below?
Which uses the similar framework as before, but clear each small trunk
(mpage) from start to end. You can adjust MPAGE_NRPAGES to check when
the regression can be restored.
WARNING: the patch is only build tested.
Base: baseline
Change1: using clear_gigantic_page() for 2M PMD
Change2: your patch with MPAGE_NRPAGES=16
Change3: Case3 + fix[1]
What is case3?
Oh, it is Change2.
Change4: your patch with MPAGE_NRPAGES=64 + fix[1]
1. For rand write,
case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference
2. For seq write,
1) case-anon-w-seq-mt:
Can you try case-anon-w-seq? That may be more stable.
base:
real 0m2.490s 0m2.254s 0m2.272s
user 1m59.980s 2m23.431s 2m18.739s
sys 1m3.675s 1m15.462s 1m15.030s
Change1:
real 0m2.234s 0m2.225s 0m2.159s
user 2m56.105s 2m57.117s 3m0.489s
sys 0m17.064s 0m17.564s 0m16.150s
Change2:
real 0m2.244s 0m2.384s 0m2.370s
user 2m39.413s 2m41.990s 2m42.229s
sys 0m19.826s 0m18.491s 0m18.053s
It appears strange. There's no much cache hot benefit even if we clear
pages from end to begin (with larger chunk).
However, sys time improves a lot. This shows clearing page with large
chunk helps on ARM64.
Change3: // best performance
real 0m2.155s 0m2.204s 0m2.194s
user 3m2.640s 2m55.837s 3m0.902s
sys 0m17.346s 0m17.630s 0m18.197s
Change4:
real 0m2.287s 0m2.377s 0m2.284s
user 2m37.030s 2m52.868s 3m17.593s
sys 0m15.445s 0m34.430s 0m45.224s
Change4 is essentially same as Change1. I don't know why they are
different. Is there some large variation among run to run?
As above shown, I test three times, the test results are relatively
stable, at least for real, I will try case-anon-w-seq.
Can you further optimize the prototype patch below? I think that it has
potential to fix your issue.
Yes, thanks for you helper, but this will make process_huge_page() a
little more complicated :)
2) case-anon-w-seq-hugetlb
very similar 1), Change4 slightly better than Change3, but not big
different.
3) hugetlbfs fallocate 20G
Change1(0m1.136s) = Change3(0m1.136s) = Change4(0m1.135s) <
Change2(0m1.275s) < base(0m3.016s)
In summary, the Change3 is best and Change1 is good on my arm64 machine.
Best Regards,
Huang, Ying
-----------------------------------8<----------------------------------------
From 406bcd1603987fdd7130d2df6f7d4aee4cc6b978 Mon Sep 17 00:00:00 2001
From: Huang Ying <ying.huang@xxxxxxxxx>
Date: Thu, 31 Oct 2024 11:13:57 +0800
Subject: [PATCH] mpage clear
---
mm/memory.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 67 insertions(+), 3 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 3ccee51adfbb..1fdc548c4275 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6769,6 +6769,68 @@ static inline int process_huge_page(
return 0;
}
+#define MPAGE_NRPAGES (1<<4)
+#define MPAGE_SIZE (PAGE_SIZE * MPAGE_NRPAGES)
+static inline int clear_huge_page(
+ unsigned long addr_hint, unsigned int nr_pages,
+ int (*process_subpage)(unsigned long addr, int idx, void *arg),
+ void *arg)
+{
+ int i, n, base, l, ret;
+ unsigned long addr = addr_hint &
+ ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
+ unsigned long nr_mpages = ((unsigned long)nr_pages << PAGE_SHIFT) / MPAGE_SIZE;
+
+ /* Process target subpage last to keep its cache lines hot */
+ might_sleep();
+ n = (addr_hint - addr) / MPAGE_SIZE;
+ if (2 * n <= nr_mpages) {
+ /* If target subpage in first half of huge page */
+ base = 0;
+ l = n;
+ /* Process subpages at the end of huge page */
+ for (i = nr_mpages - 1; i >= 2 * n; i--) {
+ cond_resched();
+ ret = process_subpage(addr + i * MPAGE_SIZE,
+ i * MPAGE_NRPAGES, arg);
+ if (ret)
+ return ret;
+ }
+ } else {
+ /* If target subpage in second half of huge page */
+ base = nr_mpages - 2 * (nr_mpages - n);
+ l = nr_mpages - n;
+ /* Process subpages at the begin of huge page */
+ for (i = 0; i < base; i++) {
+ cond_resched();
+ ret = process_subpage(addr + i * MPAGE_SIZE,
+ i * MPAGE_NRPAGES, arg);
+ if (ret)
+ return ret;
+ }
+ }
+ /*
+ * Process remaining subpages in left-right-left-right pattern
+ * towards the target subpage
+ */
+ for (i = 0; i < l; i++) {
+ int left_idx = base + i;
+ int right_idx = base + 2 * l - 1 - i;
+
+ cond_resched();
+ ret = process_subpage(addr + left_idx * MPAGE_SIZE,
+ left_idx * MPAGE_NRPAGES, arg);
+ if (ret)
+ return ret;
+ cond_resched();
+ ret = process_subpage(addr + right_idx * MPAGE_SIZE,
+ right_idx * MPAGE_NRPAGES, arg);
+ if (ret)
+ return ret;
+ }
+ return 0;
+}
+
static void clear_gigantic_page(struct folio *folio, unsigned long addr,
unsigned int nr_pages)
{
@@ -6784,8 +6846,10 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr,
static int clear_subpage(unsigned long addr, int idx, void *arg)
{
struct folio *folio = arg;
+ int i;
- clear_user_highpage(folio_page(folio, idx), addr);
+ for (i = 0; i < MPAGE_NRPAGES; i++)
+ clear_user_highpage(folio_page(folio, idx + i), addr + i * PAGE_SIZE);
return 0;
}
@@ -6798,10 +6862,10 @@ void folio_zero_user(struct folio *folio,
unsigned long addr_hint)
{
unsigned int nr_pages = folio_nr_pages(folio);
- if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
+ if (unlikely(nr_pages != HPAGE_PMD_NR))
clear_gigantic_page(folio, addr_hint, nr_pages);
else
- process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
+ clear_huge_page(addr_hint, nr_pages, clear_subpage, folio);
}
static int copy_user_gigantic_page(struct folio *dst, struct
folio *src,
[1] fix patch
diff --git a/mm/memory.c b/mm/memory.c
index b22d4b83295b..aee99ede0c4f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6816,7 +6816,7 @@ static inline int clear_huge_page(
base = 0;
l = n;
/* Process subpages at the end of huge page */
- for (i = nr_mpages - 1; i >= 2 * n; i--) {
+ for (i = 2 * n; i < nr_mpages; i++) {
cond_resched();
ret = process_subpage(addr + i * MPAGE_SIZE,
i * MPAGE_NRPAGES, arg);