On 2024/10/18 13:23, Barry Song wrote:
On Fri, Oct 18, 2024 at 6:20 PM Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> wrote:
On 2024/10/17 23:09, Matthew Wilcox wrote:
On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
Directly use folio_zero_range() to cleanup code.
Are you sure there's no performance regression introduced by this?
clear_highpage() is often optimised in ways that we can't optimise for
a plain memset(). On the other hand, if the folio is large, maybe a
modern CPU will be able to do better than clear-one-page-at-a-time.
Right, I missing this, clear_page might be better than memset, I change
this one when look at the shmem_writepage(), which already convert to
use folio_zero_range() from clear_highpage(), also I grep
folio_zero_range(), there are some other to use folio_zero_range().
fs/bcachefs/fs-io-buffered.c: folio_zero_range(folio, 0,
folio_size(folio));
fs/bcachefs/fs-io-buffered.c: folio_zero_range(f, 0, folio_size(f));
fs/bcachefs/fs-io-buffered.c: folio_zero_range(f, 0, folio_size(f));
fs/libfs.c: folio_zero_range(folio, 0, folio_size(folio));
fs/ntfs3/frecord.c: folio_zero_range(folio, 0, folio_size(folio));
mm/page_io.c: folio_zero_range(folio, 0, folio_size(folio));
mm/shmem.c: folio_zero_range(folio, 0, folio_size(folio));
IOW, what performance testing have you done with this patch?
No performance test before, but I write a testcase,
1) allocate N large folios (folio_alloc(PMD_ORDER))
2) then calculate the diff(us) when clear all N folios
clear_highpage/folio_zero_range/folio_zero_user
3) release N folios
the result(run 5 times) shown below on my machine,
N=1,
clear_highpage folio_zero_range folio_zero_user
1 69 74 177
2 57 62 168
3 54 58 234
4 54 58 157
5 56 62 148
avg 58 62.8 176.8
N=100
clear_highpage folio_zero_range folio_zero_user
1 11015 11309 32833
2 10385 11110 49751
3 10369 11056 33095
4 10332 11017 33106
5 10483 11000 49032
avg 10516.8 11098.4 39563.4
N=512
clear_highpage folio_zero_range folio_zero_user
1 55560 60055 156876
2 55485 60024 157132
3 55474 60129 156658
4 55555 59867 157259
5 55528 59932 157108
avg 55520.4 60001.4 157006.6
folio_zero_user with many cond_resched(), so time fluctuates a lot,
clear_highpage is better folio_zero_range as you said.
Maybe add a new helper to convert all folio_zero_range(folio, 0,
folio_size(folio))
to use clear_highpage + flush_dcache_folio?
If this also improves performance for other existing callers of
folio_zero_range(), then that's a positive outcome.
rm -f /tmp/test && fallocate -l 20G /tmp/test && fallocate -d -l 20G
/tmp/test && time fallocate -l 20G /tmp/test
1)mount always(2M folio)
with patch without patch
real 0m1.214s 0m1.111s
user 0m0.000s 0m0.000s
sys 0m1.210s 0m1.109s
With this patch, the performance does have regression,
folio_zero_range() is bad than clear_highpage + flush_dcache_folio
with patch
99.95% 0.00% fallocate [kernel.vmlinux] [k] vfs_fallocate
vfs_fallocate
- shmem_fallocate
98.54% __pi_clear_page
- 1.38% shmem_get_folio_gfp
filemap_get_entry
without patch
99.89% 0.00% fallocate [kernel.vmlinux] [k]
shmem_fallocate
shmem_fallocate
- shmem_get_folio_gfp
90.12% __memset
- 9.42% zero_user_segments.constprop.0
8.16% flush_dcache_page
1.03% flush_dcache_folio
2)mount never (4K folio)
real 0m3.159s 0m3.176s
user 0m0.000s 0m0.000s
sys 0m3.150s 0m3.169s
But with this patch, the performance is improved a little,
folio_zero_range() is better than clear_highpage + flush_dcache_folio
with patch
97.77% 3.37% fallocate [kernel.vmlinux] [k]
shmem_fallocate
- 94.40% shmem_fallocate
- 93.70% shmem_get_folio_gfp
66.60% __memset
- 7.43% filemap_get_entry
3.49% xas_load
1.32% zero_user_segments.constprop.0
without patch
97.82% 3.22% fallocate [kernel.vmlinux] [k]
shmem_fallocate
- 94.61% shmem_fallocate
68.18% __pi_clear_page
- 25.60% shmem_get_folio_gfp
- 7.64% filemap_get_entry
3.51% xas_load
if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
- long i, n = folio_nr_pages(folio);
-
- for (i = 0; i < n; i++)
- clear_highpage(folio_page(folio, i));
- flush_dcache_folio(folio);
+ folio_zero_range(folio, 0, folio_size(folio));
folio_mark_uptodate(folio);
}
Thanks
Barry