Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()

Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> · Fri, 1 Nov 2024 17:45:42 +0800

On 2024/11/1 16:16, Huang, Ying wrote:
Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes:

On 2024/10/31 16:39, Huang, Ying wrote:
Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes:
[snip]

1) Will test some rand test to check the different of performance as
David suggested.>>>>
2) Hope the LKP to run more tests since it is very useful(more test
set and different machines)
I'm starting to use LKP to test.

Greet.


Sorry for the late,

I have run some tests with LKP to test.
Firstly, there's almost no measurable difference between clearing
pages
from start to end or from end to start on Intel server CPU.  I guess
that there's some similar optimization for both direction.
For multiple processes (same as logical CPU number)
vm-scalability/anon-w-seq test case, the benchmark score increases
about 22.4%.

So process_huge_page is better than clear_gigantic_page() on Intel?

For vm-scalability/anon-w-seq test case, it is.  Because the performance
of forward and backward clearing is almost same, and the user space
accessing has cache-hot benefit.

Could you test the following case on x86?
echo 10240 >
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
mkdir -p /hugetlbfs/
mount none /hugetlbfs/ -t hugetlbfs
rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate
-d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G
/hugetlbfs/test

It's not trivial for me to do this test.  Because 0day wraps test cases.
Do you know which existing test cases provide this?  For example, in
vm-scalability?

I don't know the public fallocate test, I will try to find a intel 
machine to test this case.

For multiple processes vm-scalability/anon-w-rand test case, no
measurable difference for benchmark score.
So, the optimization helps sequential workload mainly.
In summary, on x86, process_huge_page() will not introduce any
regression.  And it helps some workload.
However, on ARM64, it does introduce some regression for clearing
pages
from end to start.  That needs to be addressed.  I guess that the
regression can be resolved via using more clearing from start to end
(but not all).  For example, can you take a look at the patch below?
Which uses the similar framework as before, but clear each small trunk
(mpage) from start to end.  You can adjust MPAGE_NRPAGES to check when
the regression can be restored.
WARNING: the patch is only build tested.


Base: baseline
Change1: using clear_gigantic_page() for 2M PMD
Change2: your patch with MPAGE_NRPAGES=16
Change3: Case3 + fix[1]

What is case3?

Oh, it is Change2.


Change4: your patch with MPAGE_NRPAGES=64 + fix[1]

1. For rand write,
    case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference

2. For seq write,

1) case-anon-w-seq-mt:

Can you try case-anon-w-seq?  That may be more stable.

base:
real    0m2.490s    0m2.254s    0m2.272s
user    1m59.980s   2m23.431s   2m18.739s
sys     1m3.675s    1m15.462s   1m15.030s

Change1:
real    0m2.234s    0m2.225s    0m2.159s
user    2m56.105s   2m57.117s   3m0.489s
sys     0m17.064s   0m17.564s   0m16.150s

Change2：
real	0m2.244s    0m2.384s	0m2.370s
user	2m39.413s   2m41.990s   2m42.229s
sys	0m19.826s   0m18.491s   0m18.053s

It appears strange.  There's no much cache hot benefit even if we clear
pages from end to begin (with larger chunk).

However, sys time improves a lot.  This shows clearing page with large
chunk helps on ARM64.

Change3：  // best performance
real	0m2.155s    0m2.204s	0m2.194s
user	3m2.640s    2m55.837s   3m0.902s
sys	0m17.346s   0m17.630s   0m18.197s

Change4：
real	0m2.287s    0m2.377s	0m2.284s	
user	2m37.030s   2m52.868s   3m17.593s
sys	0m15.445s   0m34.430s   0m45.224s

Change4 is essentially same as Change1.  I don't know why they are
different.  Is there some large variation among run to run?

As above shown, I test three times, the test results are relatively
stable, at least for real, I will try case-anon-w-seq.


Can you further optimize the prototype patch below?  I think that it has
potential to fix your issue.

Yes, thanks for you helper, but this will make process_huge_page() a
little more complicated :)

2) case-anon-w-seq-hugetlb
    very similar 1), Change4 slightly better than Change3, but not big
    different.

3) hugetlbfs fallocate 20G
    Change1(0m1.136s) = Change3(0m1.136s) =  Change4(0m1.135s) <
    Change2(0m1.275s) < base(0m3.016s)

In summary, the Change3 is best and Change1 is good on my arm64 machine.

Best Regards,
Huang, Ying
-----------------------------------8<----------------------------------------
  From 406bcd1603987fdd7130d2df6f7d4aee4cc6b978 Mon Sep 17 00:00:00 2001
From: Huang Ying <ying.huang@xxxxxxxxx>
Date: Thu, 31 Oct 2024 11:13:57 +0800
Subject: [PATCH] mpage clear
---
   mm/memory.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++---
   1 file changed, 67 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 3ccee51adfbb..1fdc548c4275 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6769,6 +6769,68 @@ static inline int process_huge_page(
   	return 0;
   }
   +#define MPAGE_NRPAGES	(1<<4)
+#define MPAGE_SIZE	(PAGE_SIZE * MPAGE_NRPAGES)
+static inline int clear_huge_page(
+	unsigned long addr_hint, unsigned int nr_pages,
+	int (*process_subpage)(unsigned long addr, int idx, void *arg),
+	void *arg)
+{
+	int i, n, base, l, ret;
+	unsigned long addr = addr_hint &
+		~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
+	unsigned long nr_mpages = ((unsigned long)nr_pages << PAGE_SHIFT) / MPAGE_SIZE;
+
+	/* Process target subpage last to keep its cache lines hot */
+	might_sleep();
+	n = (addr_hint - addr) / MPAGE_SIZE;
+	if (2 * n <= nr_mpages) {
+		/* If target subpage in first half of huge page */
+		base = 0;
+		l = n;
+		/* Process subpages at the end of huge page */
+		for (i = nr_mpages - 1; i >= 2 * n; i--) {
+			cond_resched();
+			ret = process_subpage(addr + i * MPAGE_SIZE,
+					      i * MPAGE_NRPAGES, arg);
+			if (ret)
+				return ret;
+		}
+	} else {
+		/* If target subpage in second half of huge page */
+		base = nr_mpages - 2 * (nr_mpages - n);
+		l = nr_mpages - n;
+		/* Process subpages at the begin of huge page */
+		for (i = 0; i < base; i++) {
+			cond_resched();
+			ret = process_subpage(addr + i * MPAGE_SIZE,
+					      i * MPAGE_NRPAGES, arg);
+			if (ret)
+				return ret;
+		}
+	}
+	/*
+	 * Process remaining subpages in left-right-left-right pattern
+	 * towards the target subpage
+	 */
+	for (i = 0; i < l; i++) {
+		int left_idx = base + i;
+		int right_idx = base + 2 * l - 1 - i;
+
+		cond_resched();
+		ret = process_subpage(addr + left_idx * MPAGE_SIZE,
+				      left_idx * MPAGE_NRPAGES, arg);
+		if (ret)
+			return ret;
+		cond_resched();
+		ret = process_subpage(addr + right_idx * MPAGE_SIZE,
+				      right_idx * MPAGE_NRPAGES, arg);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
   static void clear_gigantic_page(struct folio *folio, unsigned long addr,
   				unsigned int nr_pages)
   {
@@ -6784,8 +6846,10 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr,
   static int clear_subpage(unsigned long addr, int idx, void *arg)
   {
   	struct folio *folio = arg;
+	int i;
   -	clear_user_highpage(folio_page(folio, idx), addr);
+	for (i = 0; i < MPAGE_NRPAGES; i++)
+		clear_user_highpage(folio_page(folio, idx + i), addr + i * PAGE_SIZE);
   	return 0;
   }
   @@ -6798,10 +6862,10 @@ void folio_zero_user(struct folio *folio,
unsigned long addr_hint)
   {
   	unsigned int nr_pages = folio_nr_pages(folio);
   -	if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
+	if (unlikely(nr_pages != HPAGE_PMD_NR))
   		clear_gigantic_page(folio, addr_hint, nr_pages);
   	else
-		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
+		clear_huge_page(addr_hint, nr_pages, clear_subpage, folio);
   }
     static int copy_user_gigantic_page(struct folio *dst, struct
folio *src,


[1] fix patch

diff --git a/mm/memory.c b/mm/memory.c
index b22d4b83295b..aee99ede0c4f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6816,7 +6816,7 @@ static inline int clear_huge_page(
                 base = 0;
                 l = n;
                 /* Process subpages at the end of huge page */
-               for (i = nr_mpages - 1; i >= 2 * n; i--) {
+               for (i = 2 * n; i < nr_mpages; i++) {
                         cond_resched();
                         ret = process_subpage(addr + i * MPAGE_SIZE,
                                               i * MPAGE_NRPAGES, arg);