Re: [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order

Hugh Dickins <hughd@xxxxxxxxxx> · Mon, 17 Jun 2024 16:00:41 -0700 (PDT)

On Mon, 17 Jun 2024, Chris Li wrote:
> On Sat, Jun 15, 2024 at 1:47 AM Barry Song <21cnbao@xxxxxxxxx> wrote:
> > On Sat, Jun 15, 2024 at 2:59 PM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
> > > On Fri, 14 Jun 2024 19:51:11 -0700 Chris Li <chrisl@xxxxxxxxxx> wrote:
> > >
> > > > > I'm having trouble understanding the overall impact of this on users.
> > > > > We fail the mTHP swap allocation and fall back, but things continue to
> > > > > operate OK?
> > > >
> > > > Continue to operate OK in the sense that the mTHP will have to split
> > > > into 4K pages before the swap out, aka the fall back. The swap out and
> > > > swap in can continue to work as 4K pages, not as the mTHP. Due to the
> > > > fallback, the mTHP based zsmalloc compression with 64K buffer will not
> > > > happen. That is the effect of the fallback. But mTHP swap out and swap
> > > > in is relatively new, it is not really a regression.
> > >
> > > Sure, but it's pretty bad to merge a new feature only to have it
> > > ineffective after a few hours use.
....
> > >
> > $ /home/barry/develop/linux/mthp_swpout_test -s
> >
> > [ 1013.535798] ------------[ cut here ]------------
> > [ 1013.538886] expecting order 4 got 0
> 
> This warning means there is a bug in this series somewhere I need to hunt down.
> The V1 has the same warning but I haven't heard it get triggered in
> V1, it is something new in V2.
> 
> Andrew, please consider removing the series from mm-unstable until I
> resolve this warning assert.

Agreed: I was glad to see it go into mm-unstable last week, that made
it easier to include in testing (or harder to avoid!), but my conclusion
is that it's not ready yet (and certainly not suitable for 6.10 hotfix).

I too saw this "expecting order 4 got 0" once-warning every boot (from
ordinary page reclaim rather than from madvise_pageout shown below),
shortly after starting my tmpfs swapping load. But I never saw any bad
effect immediately after it: actual crashes came a few minutes later.

(And I'm not seeing the warning at all now, with the change I made: that
doesn't tell us much, since what I have leaves out 2/2 entirely; but it
does suggest that it's more important to follow up the crashes, and
maybe when they are satisfactorily fixed, the warning will be fixed too.)

Most crashes have been on that VM_BUG_ON(ci - si->cluster_info != idx)
in alloc_cluster(). And when I poked around, it was usually (always?)
the case that si->free_clusters was empty, so list_first_entry() not
good at all. A few other crashes were GPFs, but I didn't pay much
attention to them, thinking the alloc_cluster() one best to pursue.

I reverted both patches from mm-everything, and had no problem.
I added back 1/2 expecting it to be harmless ("no real function
change in this patch"), but was surprised to get those same
"expecting order 4 got 0" warnings and VM_BUG_ONs and GPFs:
so have spent most time trying to get 1/2 working.

This patch on top of 1/2, restoring when cluster_is_free(ci) can
be seen to change, appears to have eliminated all those problems:

--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -418,6 +418,7 @@ static struct swap_cluster_info *alloc_c
 
 	VM_BUG_ON(ci - si->cluster_info != idx);
 	list_del(&ci->list);
+	ci->state = CLUSTER_STATE_PER_CPU;
 	ci->count = 0;
 	return ci;
 }
@@ -543,10 +544,6 @@ new_cluster:
 	if (tmp == SWAP_NEXT_INVALID) {
 		if (!list_empty(&si->free_clusters)) {
 			ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
-			list_del(&ci->list);
-			spin_lock(&ci->lock);
-			ci->state = CLUSTER_STATE_PER_CPU;
-			spin_unlock(&ci->lock);
 			tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
 		} else if (!list_empty(&si->discard_clusters)) {
 			/*

Delighted to have made progress after many attempts, I went to apply 2/2
on top, but found that it builds upon those scan_swap_map_try_ssd_cluster()
changes I've undone. I gave up at that point and hand back to you, Chris,
hoping that you will understand scan_swap_map_ssd_cluster_conflict() etc
much better than I ever shall!

Clarifications on my load: all swapping to SSD, but discard not enabled;
/sys/kernel/mm/transparent_hugepage/ enabled always, shmem_enabled force,
hugepages-64kB/enabled always, hugepages-64kB/shmem_enabled always;
swapoff between iterations, did not appear relevant to problems; x86_64.

Hugh

> 
> > [ 1013.540622] WARNING: CPU: 3 PID: 104 at mm/swapfile.c:600 scan_swap_map_try_ssd_cluster+0x340/0x370
> > [ 1013.544460] Modules linked in:
> > [ 1013.545411] CPU: 3 PID: 104 Comm: mthp_swpout_tes Not tainted 6.10.0-rc3-ga12328d9fb85-dirty #285
> > [ 1013.545990] Hardware name: linux,dummy-virt (DT)
> > [ 1013.546585] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> > [ 1013.547136] pc : scan_swap_map_try_ssd_cluster+0x340/0x370
> > [ 1013.547768] lr : scan_swap_map_try_ssd_cluster+0x340/0x370
> > [ 1013.548263] sp : ffff8000863e32e0
> > [ 1013.548723] x29: ffff8000863e32e0 x28: 0000000000000670 x27: 0000000000000660
> > [ 1013.549626] x26: 0000000000000010 x25: ffff0000c1692108 x24: ffff0000c27c4800
> > [ 1013.550470] x23: 2e8ba2e8ba2e8ba3 x22: fffffdffbf7df2c0 x21: ffff0000c27c48b0
> > [ 1013.551285] x20: ffff800083a946d0 x19: 0000000000000004 x18: ffffffffffffffff
> > [ 1013.552263] x17: 0000000000000000 x16: 0000000000000000 x15: ffff800084b13568
> > [ 1013.553292] x14: ffffffffffffffff x13: ffff800084b13566 x12: 6e69746365707865
> > [ 1013.554423] x11: fffffffffffe0000 x10: ffff800083b18b68 x9 : ffff80008014c874
> > [ 1013.555231] x8 : 00000000ffffefff x7 : ffff800083b16318 x6 : 0000000000002850
> > [ 1013.555965] x5 : 40000000fffff1ae x4 : 0000000000000fff x3 : 0000000000000000
> > [ 1013.556779] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000c24a1bc0
> > [ 1013.557627] Call trace:
> > [ 1013.557960]  scan_swap_map_try_ssd_cluster+0x340/0x370
> > [ 1013.558498]  get_swap_pages+0x23c/0xc20
> > [ 1013.558899]  folio_alloc_swap+0x5c/0x248
> > [ 1013.559544]  add_to_swap+0x40/0xf0
> > [ 1013.559904]  shrink_folio_list+0x6dc/0xf20
> > [ 1013.560289]  reclaim_folio_list+0x8c/0x168
> > [ 1013.560710]  reclaim_pages+0xfc/0x178
> > [ 1013.561079]  madvise_cold_or_pageout_pte_range+0x8d8/0xf28
> > [ 1013.561524]  walk_pgd_range+0x390/0x808
> > [ 1013.561920]  __walk_page_range+0x1e0/0x1f0
> > [ 1013.562370]  walk_page_range+0x1f0/0x2c8
> > [ 1013.562888]  madvise_pageout+0xf8/0x280
> > [ 1013.563388]  madvise_vma_behavior+0x314/0xa20
> > [ 1013.563982]  madvise_walk_vmas+0xc0/0x128
> > [ 1013.564386]  do_madvise.part.0+0x110/0x558
> > [ 1013.564792]  __arm64_sys_madvise+0x68/0x88
> > [ 1013.565333]  invoke_syscall+0x50/0x128
> > [ 1013.565737]  el0_svc_common.constprop.0+0x48/0xf8
> > [ 1013.566285]  do_el0_svc+0x28/0x40
> > [ 1013.566667]  el0_svc+0x50/0x150
> > [ 1013.567094]  el0t_64_sync_handler+0x13c/0x158
> > [ 1013.567501]  el0t_64_sync+0x1a4/0x1a8
> > [ 1013.568058] irq event stamp: 0
> > [ 1013.568661] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
> > [ 1013.569560] hardirqs last disabled at (0): [<ffff8000800add44>] copy_process+0x654/0x19a8
> > [ 1013.570167] softirqs last  enabled at (0): [<ffff8000800add44>] copy_process+0x654/0x19a8
> > [ 1013.570846] softirqs last disabled at (0): [<0000000000000000>] 0x0
> > [ 1013.571330] ---[ end trace 0000000000000000 ]---