On Mon, 17 Jun 2024, Chris Li wrote: > On Sat, Jun 15, 2024 at 1:47 AM Barry Song <21cnbao@xxxxxxxxx> wrote: > > On Sat, Jun 15, 2024 at 2:59 PM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: > > > On Fri, 14 Jun 2024 19:51:11 -0700 Chris Li <chrisl@xxxxxxxxxx> wrote: > > > > > > > > I'm having trouble understanding the overall impact of this on users. > > > > > We fail the mTHP swap allocation and fall back, but things continue to > > > > > operate OK? > > > > > > > > Continue to operate OK in the sense that the mTHP will have to split > > > > into 4K pages before the swap out, aka the fall back. The swap out and > > > > swap in can continue to work as 4K pages, not as the mTHP. Due to the > > > > fallback, the mTHP based zsmalloc compression with 64K buffer will not > > > > happen. That is the effect of the fallback. But mTHP swap out and swap > > > > in is relatively new, it is not really a regression. > > > > > > Sure, but it's pretty bad to merge a new feature only to have it > > > ineffective after a few hours use. .... > > > > > $ /home/barry/develop/linux/mthp_swpout_test -s > > > > [ 1013.535798] ------------[ cut here ]------------ > > [ 1013.538886] expecting order 4 got 0 > > This warning means there is a bug in this series somewhere I need to hunt down. > The V1 has the same warning but I haven't heard it get triggered in > V1, it is something new in V2. > > Andrew, please consider removing the series from mm-unstable until I > resolve this warning assert. Agreed: I was glad to see it go into mm-unstable last week, that made it easier to include in testing (or harder to avoid!), but my conclusion is that it's not ready yet (and certainly not suitable for 6.10 hotfix). I too saw this "expecting order 4 got 0" once-warning every boot (from ordinary page reclaim rather than from madvise_pageout shown below), shortly after starting my tmpfs swapping load. But I never saw any bad effect immediately after it: actual crashes came a few minutes later. (And I'm not seeing the warning at all now, with the change I made: that doesn't tell us much, since what I have leaves out 2/2 entirely; but it does suggest that it's more important to follow up the crashes, and maybe when they are satisfactorily fixed, the warning will be fixed too.) Most crashes have been on that VM_BUG_ON(ci - si->cluster_info != idx) in alloc_cluster(). And when I poked around, it was usually (always?) the case that si->free_clusters was empty, so list_first_entry() not good at all. A few other crashes were GPFs, but I didn't pay much attention to them, thinking the alloc_cluster() one best to pursue. I reverted both patches from mm-everything, and had no problem. I added back 1/2 expecting it to be harmless ("no real function change in this patch"), but was surprised to get those same "expecting order 4 got 0" warnings and VM_BUG_ONs and GPFs: so have spent most time trying to get 1/2 working. This patch on top of 1/2, restoring when cluster_is_free(ci) can be seen to change, appears to have eliminated all those problems: --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -418,6 +418,7 @@ static struct swap_cluster_info *alloc_c VM_BUG_ON(ci - si->cluster_info != idx); list_del(&ci->list); + ci->state = CLUSTER_STATE_PER_CPU; ci->count = 0; return ci; } @@ -543,10 +544,6 @@ new_cluster: if (tmp == SWAP_NEXT_INVALID) { if (!list_empty(&si->free_clusters)) { ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list); - list_del(&ci->list); - spin_lock(&ci->lock); - ci->state = CLUSTER_STATE_PER_CPU; - spin_unlock(&ci->lock); tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER; } else if (!list_empty(&si->discard_clusters)) { /* Delighted to have made progress after many attempts, I went to apply 2/2 on top, but found that it builds upon those scan_swap_map_try_ssd_cluster() changes I've undone. I gave up at that point and hand back to you, Chris, hoping that you will understand scan_swap_map_ssd_cluster_conflict() etc much better than I ever shall! Clarifications on my load: all swapping to SSD, but discard not enabled; /sys/kernel/mm/transparent_hugepage/ enabled always, shmem_enabled force, hugepages-64kB/enabled always, hugepages-64kB/shmem_enabled always; swapoff between iterations, did not appear relevant to problems; x86_64. Hugh > > > [ 1013.540622] WARNING: CPU: 3 PID: 104 at mm/swapfile.c:600 scan_swap_map_try_ssd_cluster+0x340/0x370 > > [ 1013.544460] Modules linked in: > > [ 1013.545411] CPU: 3 PID: 104 Comm: mthp_swpout_tes Not tainted 6.10.0-rc3-ga12328d9fb85-dirty #285 > > [ 1013.545990] Hardware name: linux,dummy-virt (DT) > > [ 1013.546585] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) > > [ 1013.547136] pc : scan_swap_map_try_ssd_cluster+0x340/0x370 > > [ 1013.547768] lr : scan_swap_map_try_ssd_cluster+0x340/0x370 > > [ 1013.548263] sp : ffff8000863e32e0 > > [ 1013.548723] x29: ffff8000863e32e0 x28: 0000000000000670 x27: 0000000000000660 > > [ 1013.549626] x26: 0000000000000010 x25: ffff0000c1692108 x24: ffff0000c27c4800 > > [ 1013.550470] x23: 2e8ba2e8ba2e8ba3 x22: fffffdffbf7df2c0 x21: ffff0000c27c48b0 > > [ 1013.551285] x20: ffff800083a946d0 x19: 0000000000000004 x18: ffffffffffffffff > > [ 1013.552263] x17: 0000000000000000 x16: 0000000000000000 x15: ffff800084b13568 > > [ 1013.553292] x14: ffffffffffffffff x13: ffff800084b13566 x12: 6e69746365707865 > > [ 1013.554423] x11: fffffffffffe0000 x10: ffff800083b18b68 x9 : ffff80008014c874 > > [ 1013.555231] x8 : 00000000ffffefff x7 : ffff800083b16318 x6 : 0000000000002850 > > [ 1013.555965] x5 : 40000000fffff1ae x4 : 0000000000000fff x3 : 0000000000000000 > > [ 1013.556779] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000c24a1bc0 > > [ 1013.557627] Call trace: > > [ 1013.557960] scan_swap_map_try_ssd_cluster+0x340/0x370 > > [ 1013.558498] get_swap_pages+0x23c/0xc20 > > [ 1013.558899] folio_alloc_swap+0x5c/0x248 > > [ 1013.559544] add_to_swap+0x40/0xf0 > > [ 1013.559904] shrink_folio_list+0x6dc/0xf20 > > [ 1013.560289] reclaim_folio_list+0x8c/0x168 > > [ 1013.560710] reclaim_pages+0xfc/0x178 > > [ 1013.561079] madvise_cold_or_pageout_pte_range+0x8d8/0xf28 > > [ 1013.561524] walk_pgd_range+0x390/0x808 > > [ 1013.561920] __walk_page_range+0x1e0/0x1f0 > > [ 1013.562370] walk_page_range+0x1f0/0x2c8 > > [ 1013.562888] madvise_pageout+0xf8/0x280 > > [ 1013.563388] madvise_vma_behavior+0x314/0xa20 > > [ 1013.563982] madvise_walk_vmas+0xc0/0x128 > > [ 1013.564386] do_madvise.part.0+0x110/0x558 > > [ 1013.564792] __arm64_sys_madvise+0x68/0x88 > > [ 1013.565333] invoke_syscall+0x50/0x128 > > [ 1013.565737] el0_svc_common.constprop.0+0x48/0xf8 > > [ 1013.566285] do_el0_svc+0x28/0x40 > > [ 1013.566667] el0_svc+0x50/0x150 > > [ 1013.567094] el0t_64_sync_handler+0x13c/0x158 > > [ 1013.567501] el0t_64_sync+0x1a4/0x1a8 > > [ 1013.568058] irq event stamp: 0 > > [ 1013.568661] hardirqs last enabled at (0): [<0000000000000000>] 0x0 > > [ 1013.569560] hardirqs last disabled at (0): [<ffff8000800add44>] copy_process+0x654/0x19a8 > > [ 1013.570167] softirqs last enabled at (0): [<ffff8000800add44>] copy_process+0x654/0x19a8 > > [ 1013.570846] softirqs last disabled at (0): [<0000000000000000>] 0x0 > > [ 1013.571330] ---[ end trace 0000000000000000 ]---