Re: [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order

Chris Li <chrisl@xxxxxxxxxx> · Mon, 17 Jun 2024 16:47:46 -0700

On Mon, Jun 17, 2024 at 4:00 PM Hugh Dickins <hughd@xxxxxxxxxx> wrote:
>
> On Mon, 17 Jun 2024, Chris Li wrote:
> > On Sat, Jun 15, 2024 at 1:47 AM Barry Song <21cnbao@xxxxxxxxx> wrote:
> > > On Sat, Jun 15, 2024 at 2:59 PM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
> > > > On Fri, 14 Jun 2024 19:51:11 -0700 Chris Li <chrisl@xxxxxxxxxx> wrote:
> > > >
> > > > > > I'm having trouble understanding the overall impact of this on users.
> > > > > > We fail the mTHP swap allocation and fall back, but things continue to
> > > > > > operate OK?
> > > > >
> > > > > Continue to operate OK in the sense that the mTHP will have to split
> > > > > into 4K pages before the swap out, aka the fall back. The swap out and
> > > > > swap in can continue to work as 4K pages, not as the mTHP. Due to the
> > > > > fallback, the mTHP based zsmalloc compression with 64K buffer will not
> > > > > happen. That is the effect of the fallback. But mTHP swap out and swap
> > > > > in is relatively new, it is not really a regression.
> > > >
> > > > Sure, but it's pretty bad to merge a new feature only to have it
> > > > ineffective after a few hours use.
> ....
> > > >
> > > $ /home/barry/develop/linux/mthp_swpout_test -s
> > >
> > > [ 1013.535798] ------------[ cut here ]------------
> > > [ 1013.538886] expecting order 4 got 0
> >
> > This warning means there is a bug in this series somewhere I need to hunt down.
> > The V1 has the same warning but I haven't heard it get triggered in
> > V1, it is something new in V2.
> >
> > Andrew, please consider removing the series from mm-unstable until I
> > resolve this warning assert.
>
> Agreed: I was glad to see it go into mm-unstable last week, that made
> it easier to include in testing (or harder to avoid!), but my conclusion
> is that it's not ready yet (and certainly not suitable for 6.10 hotfix).
>
> I too saw this "expecting order 4 got 0" once-warning every boot (from
> ordinary page reclaim rather than from madvise_pageout shown below),
> shortly after starting my tmpfs swapping load. But I never saw any bad
> effect immediately after it: actual crashes came a few minutes later.
>
> (And I'm not seeing the warning at all now, with the change I made: that
> doesn't tell us much, since what I have leaves out 2/2 entirely; but it
> does suggest that it's more important to follow up the crashes, and
> maybe when they are satisfactorily fixed, the warning will be fixed too.)
>
> Most crashes have been on that VM_BUG_ON(ci - si->cluster_info != idx)
> in alloc_cluster(). And when I poked around, it was usually (always?)
> the case that si->free_clusters was empty, so list_first_entry() not
> good at all. A few other crashes were GPFs, but I didn't pay much
> attention to them, thinking the alloc_cluster() one best to pursue.
>
> I reverted both patches from mm-everything, and had no problem.
> I added back 1/2 expecting it to be harmless ("no real function
> change in this patch"), but was surprised to get those same
> "expecting order 4 got 0" warnings and VM_BUG_ONs and GPFs:
> so have spent most time trying to get 1/2 working.
>
> This patch on top of 1/2, restoring when cluster_is_free(ci) can
> be seen to change, appears to have eliminated all those problems:
>
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -418,6 +418,7 @@ static struct swap_cluster_info *alloc_c
>
>         VM_BUG_ON(ci - si->cluster_info != idx);
>         list_del(&ci->list);
> +       ci->state = CLUSTER_STATE_PER_CPU;
>         ci->count = 0;
>         return ci;
>  }
> @@ -543,10 +544,6 @@ new_cluster:
>         if (tmp == SWAP_NEXT_INVALID) {
>                 if (!list_empty(&si->free_clusters)) {
>                         ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> -                       list_del(&ci->list);
> -                       spin_lock(&ci->lock);
> -                       ci->state = CLUSTER_STATE_PER_CPU;
> -                       spin_unlock(&ci->lock);
>                         tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
>                 } else if (!list_empty(&si->discard_clusters)) {
>                         /*
>

Thanks for the nice bug report. That is my bad.

Both you and Ying point out the critical bug here: The cluster was
removed from the free list inside try_ssd() and in the case of
conflict() failure followed by alloc_cluster(). It allocates from the
cluster, it can remove the same cluster from the list again. That is
the path I haven't considered well.

All this attempt of allocation in try_ssd() but can have possible
conflict and perform the dance in alloc_cluster() make things very
complicated. In the try_ssd() when we have the cluster lock, can we
just perform the actual allocation with lock held? There should not be
conflict with the cluster lock protection, right?

Chris

> Delighted to have made progress after many attempts, I went to apply 2/2
> on top, but found that it builds upon those scan_swap_map_try_ssd_cluster()
> changes I've undone. I gave up at that point and hand back to you, Chris,
> hoping that you will understand scan_swap_map_ssd_cluster_conflict() etc
> much better than I ever shall!

>
> Clarifications on my load: all swapping to SSD, but discard not enabled;
> /sys/kernel/mm/transparent_hugepage/ enabled always, shmem_enabled force,
> hugepages-64kB/enabled always, hugepages-64kB/shmem_enabled always;
> swapoff between iterations, did not appear relevant to problems; x86_64.
>
> Hugh
>
> >
> > > [ 1013.540622] WARNING: CPU: 3 PID: 104 at mm/swapfile.c:600 scan_swap_map_try_ssd_cluster+0x340/0x370
> > > [ 1013.544460] Modules linked in:
> > > [ 1013.545411] CPU: 3 PID: 104 Comm: mthp_swpout_tes Not tainted 6.10.0-rc3-ga12328d9fb85-dirty #285
> > > [ 1013.545990] Hardware name: linux,dummy-virt (DT)
> > > [ 1013.546585] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> > > [ 1013.547136] pc : scan_swap_map_try_ssd_cluster+0x340/0x370
> > > [ 1013.547768] lr : scan_swap_map_try_ssd_cluster+0x340/0x370
> > > [ 1013.548263] sp : ffff8000863e32e0
> > > [ 1013.548723] x29: ffff8000863e32e0 x28: 0000000000000670 x27: 0000000000000660
> > > [ 1013.549626] x26: 0000000000000010 x25: ffff0000c1692108 x24: ffff0000c27c4800
> > > [ 1013.550470] x23: 2e8ba2e8ba2e8ba3 x22: fffffdffbf7df2c0 x21: ffff0000c27c48b0
> > > [ 1013.551285] x20: ffff800083a946d0 x19: 0000000000000004 x18: ffffffffffffffff
> > > [ 1013.552263] x17: 0000000000000000 x16: 0000000000000000 x15: ffff800084b13568
> > > [ 1013.553292] x14: ffffffffffffffff x13: ffff800084b13566 x12: 6e69746365707865
> > > [ 1013.554423] x11: fffffffffffe0000 x10: ffff800083b18b68 x9 : ffff80008014c874
> > > [ 1013.555231] x8 : 00000000ffffefff x7 : ffff800083b16318 x6 : 0000000000002850
> > > [ 1013.555965] x5 : 40000000fffff1ae x4 : 0000000000000fff x3 : 0000000000000000
> > > [ 1013.556779] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000c24a1bc0
> > > [ 1013.557627] Call trace:
> > > [ 1013.557960]  scan_swap_map_try_ssd_cluster+0x340/0x370
> > > [ 1013.558498]  get_swap_pages+0x23c/0xc20
> > > [ 1013.558899]  folio_alloc_swap+0x5c/0x248
> > > [ 1013.559544]  add_to_swap+0x40/0xf0
> > > [ 1013.559904]  shrink_folio_list+0x6dc/0xf20
> > > [ 1013.560289]  reclaim_folio_list+0x8c/0x168
> > > [ 1013.560710]  reclaim_pages+0xfc/0x178
> > > [ 1013.561079]  madvise_cold_or_pageout_pte_range+0x8d8/0xf28
> > > [ 1013.561524]  walk_pgd_range+0x390/0x808
> > > [ 1013.561920]  __walk_page_range+0x1e0/0x1f0
> > > [ 1013.562370]  walk_page_range+0x1f0/0x2c8
> > > [ 1013.562888]  madvise_pageout+0xf8/0x280
> > > [ 1013.563388]  madvise_vma_behavior+0x314/0xa20
> > > [ 1013.563982]  madvise_walk_vmas+0xc0/0x128
> > > [ 1013.564386]  do_madvise.part.0+0x110/0x558
> > > [ 1013.564792]  __arm64_sys_madvise+0x68/0x88
> > > [ 1013.565333]  invoke_syscall+0x50/0x128
> > > [ 1013.565737]  el0_svc_common.constprop.0+0x48/0xf8
> > > [ 1013.566285]  do_el0_svc+0x28/0x40
> > > [ 1013.566667]  el0_svc+0x50/0x150
> > > [ 1013.567094]  el0t_64_sync_handler+0x13c/0x158
> > > [ 1013.567501]  el0t_64_sync+0x1a4/0x1a8
> > > [ 1013.568058] irq event stamp: 0
> > > [ 1013.568661] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
> > > [ 1013.569560] hardirqs last disabled at (0): [<ffff8000800add44>] copy_process+0x654/0x19a8
> > > [ 1013.570167] softirqs last  enabled at (0): [<ffff8000800add44>] copy_process+0x654/0x19a8
> > > [ 1013.570846] softirqs last disabled at (0): [<0000000000000000>] 0x0
> > > [ 1013.571330] ---[ end trace 0000000000000000 ]---