On Mon, Jun 17, 2024 at 4:00 PM Hugh Dickins <hughd@xxxxxxxxxx> wrote: > > On Mon, 17 Jun 2024, Chris Li wrote: > > On Sat, Jun 15, 2024 at 1:47 AM Barry Song <21cnbao@xxxxxxxxx> wrote: > > > On Sat, Jun 15, 2024 at 2:59 PM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > On Fri, 14 Jun 2024 19:51:11 -0700 Chris Li <chrisl@xxxxxxxxxx> wrote: > > > > > > > > > > I'm having trouble understanding the overall impact of this on users. > > > > > > We fail the mTHP swap allocation and fall back, but things continue to > > > > > > operate OK? > > > > > > > > > > Continue to operate OK in the sense that the mTHP will have to split > > > > > into 4K pages before the swap out, aka the fall back. The swap out and > > > > > swap in can continue to work as 4K pages, not as the mTHP. Due to the > > > > > fallback, the mTHP based zsmalloc compression with 64K buffer will not > > > > > happen. That is the effect of the fallback. But mTHP swap out and swap > > > > > in is relatively new, it is not really a regression. > > > > > > > > Sure, but it's pretty bad to merge a new feature only to have it > > > > ineffective after a few hours use. > .... > > > > > > > $ /home/barry/develop/linux/mthp_swpout_test -s > > > > > > [ 1013.535798] ------------[ cut here ]------------ > > > [ 1013.538886] expecting order 4 got 0 > > > > This warning means there is a bug in this series somewhere I need to hunt down. > > The V1 has the same warning but I haven't heard it get triggered in > > V1, it is something new in V2. > > > > Andrew, please consider removing the series from mm-unstable until I > > resolve this warning assert. > > Agreed: I was glad to see it go into mm-unstable last week, that made > it easier to include in testing (or harder to avoid!), but my conclusion > is that it's not ready yet (and certainly not suitable for 6.10 hotfix). > > I too saw this "expecting order 4 got 0" once-warning every boot (from > ordinary page reclaim rather than from madvise_pageout shown below), > shortly after starting my tmpfs swapping load. But I never saw any bad > effect immediately after it: actual crashes came a few minutes later. > > (And I'm not seeing the warning at all now, with the change I made: that > doesn't tell us much, since what I have leaves out 2/2 entirely; but it > does suggest that it's more important to follow up the crashes, and > maybe when they are satisfactorily fixed, the warning will be fixed too.) > > Most crashes have been on that VM_BUG_ON(ci - si->cluster_info != idx) > in alloc_cluster(). And when I poked around, it was usually (always?) > the case that si->free_clusters was empty, so list_first_entry() not > good at all. A few other crashes were GPFs, but I didn't pay much > attention to them, thinking the alloc_cluster() one best to pursue. > > I reverted both patches from mm-everything, and had no problem. > I added back 1/2 expecting it to be harmless ("no real function > change in this patch"), but was surprised to get those same > "expecting order 4 got 0" warnings and VM_BUG_ONs and GPFs: > so have spent most time trying to get 1/2 working. > > This patch on top of 1/2, restoring when cluster_is_free(ci) can > be seen to change, appears to have eliminated all those problems: > > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -418,6 +418,7 @@ static struct swap_cluster_info *alloc_c > > VM_BUG_ON(ci - si->cluster_info != idx); > list_del(&ci->list); > + ci->state = CLUSTER_STATE_PER_CPU; > ci->count = 0; > return ci; > } > @@ -543,10 +544,6 @@ new_cluster: > if (tmp == SWAP_NEXT_INVALID) { > if (!list_empty(&si->free_clusters)) { > ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list); > - list_del(&ci->list); > - spin_lock(&ci->lock); > - ci->state = CLUSTER_STATE_PER_CPU; > - spin_unlock(&ci->lock); > tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER; > } else if (!list_empty(&si->discard_clusters)) { > /* > Thanks for the nice bug report. That is my bad. Both you and Ying point out the critical bug here: The cluster was removed from the free list inside try_ssd() and in the case of conflict() failure followed by alloc_cluster(). It allocates from the cluster, it can remove the same cluster from the list again. That is the path I haven't considered well. All this attempt of allocation in try_ssd() but can have possible conflict and perform the dance in alloc_cluster() make things very complicated. In the try_ssd() when we have the cluster lock, can we just perform the actual allocation with lock held? There should not be conflict with the cluster lock protection, right? Chris > Delighted to have made progress after many attempts, I went to apply 2/2 > on top, but found that it builds upon those scan_swap_map_try_ssd_cluster() > changes I've undone. I gave up at that point and hand back to you, Chris, > hoping that you will understand scan_swap_map_ssd_cluster_conflict() etc > much better than I ever shall! > > Clarifications on my load: all swapping to SSD, but discard not enabled; > /sys/kernel/mm/transparent_hugepage/ enabled always, shmem_enabled force, > hugepages-64kB/enabled always, hugepages-64kB/shmem_enabled always; > swapoff between iterations, did not appear relevant to problems; x86_64. > > Hugh > > > > > > [ 1013.540622] WARNING: CPU: 3 PID: 104 at mm/swapfile.c:600 scan_swap_map_try_ssd_cluster+0x340/0x370 > > > [ 1013.544460] Modules linked in: > > > [ 1013.545411] CPU: 3 PID: 104 Comm: mthp_swpout_tes Not tainted 6.10.0-rc3-ga12328d9fb85-dirty #285 > > > [ 1013.545990] Hardware name: linux,dummy-virt (DT) > > > [ 1013.546585] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) > > > [ 1013.547136] pc : scan_swap_map_try_ssd_cluster+0x340/0x370 > > > [ 1013.547768] lr : scan_swap_map_try_ssd_cluster+0x340/0x370 > > > [ 1013.548263] sp : ffff8000863e32e0 > > > [ 1013.548723] x29: ffff8000863e32e0 x28: 0000000000000670 x27: 0000000000000660 > > > [ 1013.549626] x26: 0000000000000010 x25: ffff0000c1692108 x24: ffff0000c27c4800 > > > [ 1013.550470] x23: 2e8ba2e8ba2e8ba3 x22: fffffdffbf7df2c0 x21: ffff0000c27c48b0 > > > [ 1013.551285] x20: ffff800083a946d0 x19: 0000000000000004 x18: ffffffffffffffff > > > [ 1013.552263] x17: 0000000000000000 x16: 0000000000000000 x15: ffff800084b13568 > > > [ 1013.553292] x14: ffffffffffffffff x13: ffff800084b13566 x12: 6e69746365707865 > > > [ 1013.554423] x11: fffffffffffe0000 x10: ffff800083b18b68 x9 : ffff80008014c874 > > > [ 1013.555231] x8 : 00000000ffffefff x7 : ffff800083b16318 x6 : 0000000000002850 > > > [ 1013.555965] x5 : 40000000fffff1ae x4 : 0000000000000fff x3 : 0000000000000000 > > > [ 1013.556779] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000c24a1bc0 > > > [ 1013.557627] Call trace: > > > [ 1013.557960] scan_swap_map_try_ssd_cluster+0x340/0x370 > > > [ 1013.558498] get_swap_pages+0x23c/0xc20 > > > [ 1013.558899] folio_alloc_swap+0x5c/0x248 > > > [ 1013.559544] add_to_swap+0x40/0xf0 > > > [ 1013.559904] shrink_folio_list+0x6dc/0xf20 > > > [ 1013.560289] reclaim_folio_list+0x8c/0x168 > > > [ 1013.560710] reclaim_pages+0xfc/0x178 > > > [ 1013.561079] madvise_cold_or_pageout_pte_range+0x8d8/0xf28 > > > [ 1013.561524] walk_pgd_range+0x390/0x808 > > > [ 1013.561920] __walk_page_range+0x1e0/0x1f0 > > > [ 1013.562370] walk_page_range+0x1f0/0x2c8 > > > [ 1013.562888] madvise_pageout+0xf8/0x280 > > > [ 1013.563388] madvise_vma_behavior+0x314/0xa20 > > > [ 1013.563982] madvise_walk_vmas+0xc0/0x128 > > > [ 1013.564386] do_madvise.part.0+0x110/0x558 > > > [ 1013.564792] __arm64_sys_madvise+0x68/0x88 > > > [ 1013.565333] invoke_syscall+0x50/0x128 > > > [ 1013.565737] el0_svc_common.constprop.0+0x48/0xf8 > > > [ 1013.566285] do_el0_svc+0x28/0x40 > > > [ 1013.566667] el0_svc+0x50/0x150 > > > [ 1013.567094] el0t_64_sync_handler+0x13c/0x158 > > > [ 1013.567501] el0t_64_sync+0x1a4/0x1a8 > > > [ 1013.568058] irq event stamp: 0 > > > [ 1013.568661] hardirqs last enabled at (0): [<0000000000000000>] 0x0 > > > [ 1013.569560] hardirqs last disabled at (0): [<ffff8000800add44>] copy_process+0x654/0x19a8 > > > [ 1013.570167] softirqs last enabled at (0): [<ffff8000800add44>] copy_process+0x654/0x19a8 > > > [ 1013.570846] softirqs last disabled at (0): [<0000000000000000>] 0x0 > > > [ 1013.571330] ---[ end trace 0000000000000000 ]---