On 2024/10/28 14:37, Kefeng Wang wrote:
On 2024/10/28 10:39, Huang, Ying wrote:
Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes:
On 2024/10/25 20:21, Huang, Ying wrote:
Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes:
On 2024/10/25 15:47, Huang, Ying wrote:
Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes:
On 2024/10/25 10:59, Huang, Ying wrote:
Hi, Kefeng,
Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes:
+CC Huang Ying,
On 2024/10/23 6:56, Barry Song wrote:
On Wed, Oct 23, 2024 at 4:10 AM Kefeng Wang
<wangkefeng.wang@xxxxxxxxxx> wrote:
...
On 2024/10/17 23:09, Matthew Wilcox wrote:
On Thu, Oct 17, 2024 at 10:25:04PM +0800,
Kefeng Wang wrote:
Directly use folio_zero_range() to cleanup code.
Are you sure there's no performance regression
introduced by this?
clear_highpage() is often optimised in ways
that we can't optimise for
a plain memset(). On the other hand, if the
folio is large, maybe a
modern CPU will be able to do better than
clear-one-page-at-a-time.
Right, I missing this, clear_page might be
better than memset, I change
this one when look at the shmem_writepage(),
which already convert to
use folio_zero_range() from clear_highpage(),
also I grep
folio_zero_range(), there are some other to use
folio_zero_range().
fs/bcachefs/fs-io-buffered.c:
folio_zero_range(folio, 0,
folio_size(folio));
fs/bcachefs/fs-io-buffered.c:
folio_zero_range(f,
0, folio_size(f));
fs/bcachefs/fs-io-buffered.c:
folio_zero_range(f,
0, folio_size(f));
fs/libfs.c: folio_zero_range(folio, 0,
folio_size(folio));
fs/ntfs3/frecord.c:
folio_zero_range(folio, 0,
folio_size(folio));
mm/page_io.c: folio_zero_range(folio, 0,
folio_size(folio));
mm/shmem.c: folio_zero_range(folio,
0, folio_size(folio));
IOW, what performance testing have you done
with this patch?
No performance test before, but I write a
testcase,
1) allocate N large folios
(folio_alloc(PMD_ORDER))
2) then calculate the diff(us) when clear all N
folios
clear_highpage/folio_zero_range/
folio_zero_user
3) release N folios
the result(run 5 times) shown below on my machine,
N=1,
clear_highpage
folio_zero_range folio_zero_user
1 69 74
177
2 57 62
168
3 54 58
234
4 54 58
157
5 56 62
148
avg 58 62.8
176.8
N=100
clear_highpage
folio_zero_range folio_zero_user
1 11015 11309
32833
2 10385 11110
49751
3 10369 11056
33095
4 10332 11017
33106
5 10483 11000
49032
avg 10516.8 11098.4
39563.4
N=512
clear_highpage
folio_zero_range folio_zero_user
1 55560 60055
156876
2 55485 60024
157132
3 55474 60129
156658
4 55555 59867
157259
5 55528 59932
157108
avg 55520.4 60001.4
157006.6
folio_zero_user with many cond_resched(), so
time fluctuates a lot,
clear_highpage is better folio_zero_range as
you said.
Maybe add a new helper to convert all
folio_zero_range(folio, 0,
folio_size(folio))
to use clear_highpage + flush_dcache_folio?
If this also improves performance for other
existing callers of
folio_zero_range(), then that's a positive outcome.
...
hi Kefeng,
what's your point? providing a helper like
clear_highfolio() or similar?
Yes, from above test, using clear_highpage/
flush_dcache_folio is better
than using folio_zero_range() for folio
zero(especially for large
folio), so I'd like to add a new helper, maybe name
it folio_zero()
since it zero the whole folio.
we already have a helper like folio_zero_user()?
it is not good enough?
Since it is with many cond_resched(), the performance
is worst...
Not exactly? It should have zero cost for a preemptible
kernel.
For a non-preemptible kernel, it helps avoid clearing
the folio
from occupying the CPU and starving other processes,
right?
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2393,10 +2393,7 @@ static int
shmem_get_folio_gfp(struct inode
*inode, pgoff_t index,
* it now, lest undo on failure cancel
our earlier guarantee.
*/
if (sgp != SGP_WRITE && !
folio_test_uptodate(folio)) {
- long i, n = folio_nr_pages(folio);
-
- for (i = 0; i < n; i++)
- clear_highpage(folio_page(folio,
i));
+ folio_zero_user(folio, vmf->address);
flush_dcache_folio(folio);
folio_mark_uptodate(folio);
}
Do we perform better or worse with the following?
Here is for SGP_FALLOC, vmf = NULL, we could use
folio_zero_user(folio,
0), I think the performance is worse, will retest once I
can access
hardware.
Perhaps, since the current code uses clear_hugepage().
Does using
index << PAGE_SHIFT as the addr_hint offer any benefit?
when use folio_zero_user(), the performance is vary bad
with above
fallocate test(mount huge=always),
folio_zero_range clear_highpage
folio_zero_user
real 0m1.214s 0m1.111s 0m3.159s
user 0m0.000s 0m0.000s 0m0.000s
sys 0m1.210s 0m1.109s 0m3.152s
I tried with addr_hint = 0/index << PAGE_SHIFT, no obvious
different.
Interesting. Does your kernel have preemption disabled or
preemption_debug enabled?
ARM64 server, CONFIG_PREEMPT_NONE=y
this explains why the performance is much worse.
If not, it makes me wonder whether folio_zero_user() in
alloc_anon_folio() is actually improving performance as
expected,
compared to the simpler folio_zero() you plan to implement. :-)
Yes, maybe, the folio_zero_user(was clear_huge_page) is from
47ad8475c000 ("thp: clear_copy_huge_page"), so original
clear_huge_page
is used in HugeTLB, clear PUD size maybe spend many time, but
for PMD or
other size of large folio, cond_resched is not necessary
since we
already have some folio_zero_range() to clear large folio,
and no issue
was reported.
probably worth an optimization. calling cond_resched() for
each page
seems too aggressive and useless.
After some test, I think the cond_resched() is not the root cause,
no performance gained with batched cond_resched(), even I kill
cond_resched() from process_huge_page, no improvement.
But when I unconditionally use clear_gigantic_page() in
folio_zero_user(patched), there is big improvement with above
fallocate on tmpfs(mount huge=always), also I test some other
testcase,
1) case-anon-w-seq-mt: (2M PMD THP)
base:
real 0m2.490s 0m2.254s 0m2.272s
user 1m59.980s 2m23.431s 2m18.739s
sys 1m3.675s 1m15.462s 1m15.030s
patched:
real 0m2.234s 0m2.225s 0m2.159s
user 2m56.105s 2m57.117s 3m0.489s
sys 0m17.064s 0m17.564s 0m16.150s
Patched kernel win on sys and bad in user, but real is almost
same,
maybe a little better than base.
We can find user time difference. That means the original cache
hot
behavior still applies on your system.
However, it appears that the performance to clear page from end to
begin
is really bad on your system.
So, I suggest to revise the current implementation to use
sequential
clearing as much as possible.
I test case-anon-cow-seq-hugetlb for copy_user_large_folio()
base:
real 0m6.259s 0m6.197s 0m6.316s
user 1m31.176s 1m27.195s 1m29.594s
sys 7m44.199s 7m51.490s 8m21.149s
patched(use copy_user_gigantic_page for 2M hugetlb too)
real 0m3.182s 0m3.002s 0m2.963s
user 1m19.456s 1m3.107s 1m6.447s
sys 2m59.222s 3m10.899s 3m1.027s
and sequential copy is better than the current implementation,
so I will use sequential clear and copy.
Sorry, it appears that you misunderstanding my suggestion. I
suggest to
revise process_huge_page() to use more sequential memory clearing and
copying to improve its performance on your platform.
--
Best Regards,
Huang, Ying
2) case-anon-w-seq-hugetlb:(2M PMD HugeTLB)
base:
real 0m5.175s 0m5.117s 0m4.856s
user 5m15.943s 5m7.567s 4m29.273s
sys 2m38.503s 2m21.949s 2m21.252s
patched:
real 0m4.966s 0m4.841s 0m4.561s
user 6m30.123s 6m9.516s 5m49.733s
sys 0m58.503s 0m47.847s 0m46.785s
This case is similar to the case1.
3) fallocate hugetlb 20G (2M PMD HugeTLB)
base:
real 0m3.016s 0m3.019s 0m3.018s
user 0m0.000s 0m0.000s 0m0.000s
sys 0m3.009s 0m3.012s 0m3.010s
patched:
real 0m1.136s 0m1.136s 0m1.136s
user 0m0.000s 0m0.000s 0m0.004s
sys 0m1.133s 0m1.133s 0m1.129s
There is big win on patched kernel, and it is similar to above
tmpfs
test, so maybe we could revert the commit c79b57e462b5 ("mm:
hugetlb:
clear target sub-page last when clearing huge page").
I tried the following changes,
diff --git a/mm/memory.c b/mm/memory.c
index 66cf855dee3f..e5cc75adfa10 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6777,7 +6777,7 @@ static inline int process_huge_page(
base = 0;
l = n;
/* Process subpages at the end of huge page */
- for (i = nr_pages - 1; i >= 2 * n; i--) {
+ for (i = 2 * n; i < nr_pages; i++) {
cond_resched();
ret = process_subpage(addr + i *
PAGE_SIZE, i,
arg);
if (ret)
Since n = 0, so the copying is from start to end now, but not
improvement for case-anon-cow-seq-hugetlb,
and if use copy_user_gigantic_pager, the time reduced from 6s to 3s
diff --git a/mm/memory.c b/mm/memory.c
index fe21bd3beff5..2c6532d21d84 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6876,10 +6876,7 @@ int copy_user_large_folio(struct folio *dst,
struct folio *src,
.vma = vma,
};
- if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
- return copy_user_gigantic_page(dst, src, addr_hint,
vma, nr_pages);
-
- return process_huge_page(addr_hint, nr_pages, copy_subpage,
&arg);
+ return copy_user_gigantic_page(dst, src, addr_hint, vma,
nr_pages);
}
It appears that we have code generation issue here. Can you check
it?
Whether code is inlined in the same way?
No different, and I checked the asm, both process_huge_page and
copy_user_gigantic_page are inlined, it is strange...
It's not inlined in my configuration. And __always_inline below changes
it for me.
If it's already inlined and the code is actually almost same, why
there's difference? Is it possible for you to do some profile or
further analysis?
Yes, will continue to debug this.
My bad, I has some refactor patch before using copy_user_large_folio(),
ba3fda2a7b08 mm: use copy_user_large_page // good performance
a88666ae8f4d mm: call might_sleep() in folio_zero/copy_user()
3ab7d4d405e9 mm: calculate the base address in the folio_zero/copy_user()
7b240664c07d mm: convert to folio_copy_user() // I made a mistake which
use dst instead of src in copy_user_gigantic_page()
1a951e310aa9 mm: use aligned address in copy_user_gigantic_page()
e095ce052607 mm: use aligned address in clear_gigantic_page()
so please ignore the copy test result (case-anon-cow-seq-hugetlb)
In summary:
1) for copying, no obvious different between
copy_user_large_folio/process_huge_page(copying from last to start or
copying from start to last)
2) for clearing, clear_gigantic_page is better than process_huge_page
from my machine, and after clearing page from start to last(current, it
process page from last to first), the performance is same to the
clear_gigantic_page.
Maybe we can start with
modified mm/memory.c
@@ -6714,7 +6714,7 @@ EXPORT_SYMBOL(__might_fault);
* operation. The target subpage will be processed last to keep its
* cache lines hot.
*/
-static inline int process_huge_page(
+static __always_inline int process_huge_page(
unsigned long addr_hint, unsigned int nr_pages,
int (*process_subpage)(unsigned long addr, int idx, void *arg),
void *arg)
--
Best Regards,
Huang, Ying