Hi Willy, I'm seeing a softlockup issue in the madvise() pageout -> reclaim codepath that turned into the VM_BUG_ON() splat below[1] with debug enabled. The bug corresponds to the following code in __delete_from_swap_cache(): ... for (i = 0; i < nr; i++) { void *entry = xas_store(&xas, shadow); VM_BUG_ON_FOLIO(entry != folio, folio); set_page_private(folio_page(folio, i), 0); xas_next(&xas); } ... The immediate reason for failure is because the swap entry is zero, so the entry passed in from the caller (via folio->private) looks bogus. This page was originally added to swapcache as a 2MB hugepage, then is being split here and each subpage removed/freed via this split call. The splat occurs attempting to remove the first subpage. It looks like the reason the swapentry is lost is page->private being cleared a bit earlier in __split_huge_page_tail(). This was added via commit b653db77350c7 ("mm: Clear page->private when splitting or migrating a page"). I don't have context for the problem fixed by that patch, but (so far) the following tweak seems to address both issues I've seen (so I don't have detailed root cause of the soft lockup variant, but from testing it appears to be a side effect of this problem): diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e9414ee57c5b..c2ddbb81a743 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2445,7 +2445,8 @@ static void __split_huge_page_tail(struct page *head, int tail, page_tail); page_tail->mapping = head->mapping; page_tail->index = head->index + tail; - page_tail->private = 0; + if (!PageSwapCache(page_tail)) + page_tail->private = 0; /* Page flags must be visible before we make the page non-compound. */ smp_wmb(); Thoughts? If this makes sense I can send it as a proper patch.. Brian [1] bug splat: page:000000001c1895ba refcount:2 mapcount:0 mapping:00000000164a725a index:0x7f6441401 pfn:0x1d07a01 memcg:ff4ec5f22893a000 anon flags: 0x17ffffc008043d(locked|uptodate|dirty|lru|active|owner_priv_1|swapbacked|node=0|zone=2|lastcpupid=0x1fffff) raw: 0017ffffc008043d ffd64cba341e8008 ffd64cba341e8088 ff4ec5f20ea6e791 raw: 00000007f6441401 0000000000000000 00000002ffffffff ff4ec5f22893a000 page dumped because: VM_BUG_ON_FOLIO(entry != folio) ------------[ cut here ]------------ kernel BUG at mm/swap_state.c:154! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 34 PID: 12321 Comm: stress-ng Kdump: loaded Tainted: G E 6.0.0-rc3+ #4 Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.2.4 05/28/2021 RIP: 0010:__delete_from_swap_cache+0x21c/0x250 Code: 04 25 28 00 00 00 75 46 48 83 c4 40 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 48 c7 c6 f8 ad 59 96 4c 89 f7 e8 14 e3 fb ff <0f> 0b 48 c7 c6 f0 34 59 96 4c 89 f7 e8 03 e3 fb ff 0f 0b 48 c7 c6 RSP: 0018:ff8dce04e9267878 EFLAGS: 00010046 RAX: 0000000000000034 RBX: 0000000000000000 RCX: 0000000000000027 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ff4ec5f0bfc5f860 RBP: 0000000000000001 R08: 0000000000000000 R09: 00000000ffff7fff R10: ff8dce04e9267708 R11: ffffffff96fe7368 R12: ff4ec5b2a1e88000 R13: 0000000000000001 R14: ffd64cba341e8040 R15: 0000000000000000 FS: 00007f660b421740(0000) GS:ff4ec5f0bfc40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6498aa6000 CR3: 00000001764ba006 CR4: 0000000000771ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> delete_from_swap_cache+0x4c/0xc0 try_to_free_swap+0x115/0x160 free_swap_cache+0x7f/0xc0 free_page_and_swap_cache+0xf/0xd0 __split_huge_page+0x4b5/0x780 split_huge_page_to_list+0x6f9/0xa80 madvise_cold_or_pageout_pte_range+0x433/0xd90 ? sysvec_call_function_single+0x41/0x90 walk_pmd_range.isra.0+0xc3/0x320 walk_pud_range.isra.0+0x137/0x250 walk_p4d_range+0x10b/0x170 walk_pgd_range+0x11e/0x180 __walk_page_range+0x56/0x1a0 walk_page_range+0xaa/0x130 madvise_pageout+0xf6/0x170 ? rseq_get_rseq_cs.isra.0+0x16/0x220 madvise_vma_behavior+0x44d/0x6c0 ? find_vma+0x20/0x80 do_madvise.part.0+0x1a7/0x330 __x64_sys_madvise+0x5a/0x70 do_syscall_64+0x59/0x90 ? ktime_get+0x35/0xa0 ? clockevents_program_event+0x92/0x100 ? hrtimer_interrupt+0x126/0x210 ? sched_clock_cpu+0x9/0xb0 ? irqtime_account_irq+0x3c/0xb0 ? __irq_exit_rcu+0x46/0xe0 ? sysvec_apic_timer_interrupt+0x3c/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f660b23eeeb Code: 73 01 c3 48 8b 0d 35 af 1b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 1c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 05 af 1b 00 f7 d8 64 89 01 48 RSP: 002b:00007ffd3bfcdee8 EFLAGS: 00000202 ORIG_RAX: 000000000000001c RAX: ffffffffffffffda RBX: 00007ffd3bfce0d0 RCX: 00007f660b23eeeb RDX: 0000000000000015 RSI: 0000000257b3f000 RDI: 00007f63b1082000 RBP: 00007f63b1082000 R08: 0000000000000000 R09: 00000000000000cb R10: 0000000000000000 R11: 0000000000000202 R12: 0000000257b3f000 R13: 00007f6608bc1000 R14: 00007ffd3bfcdff0 R15: 0000000000000000 </TASK> Modules linked in: rfkill(E) sunrpc(E) intel_rapl_msr(E) intel_rapl_common(E) intel_uncore_frequency(E) intel_uncore_frequency_common(E) i10nm_edac(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) ipmi_ssif(E) coretemp(E) kvm_intel(E) mgag200(E) i2c_algo_bit(E) drm_shmem_helper(E) mlx5_ib(E) kvm(E) dcdbas(E) irqbypass(E) drm_kms_helper(E) ib_uverbs(E) rapl(E) acpi_ipmi(E) intel_cstate(E) ipmi_si(E) syscopyarea(E) mei_me(E) dell_smbios(E) ib_core(E) sysfillrect(E) ipmi_devintf(E) intel_uncore(E) nd_pmem(E) wmi_bmof(E) pcspkr(E) dell_wmi_descriptor(E) sysimgblt(E) i2c_i801(E) isst_if_mbox_pci(E) isst_if_mmio(E) intel_vsec(E) isst_if_common(E) fb_sys_fops(E) i2c_smbus(E) mei(E) ipmi_msghandler(E) nd_btt(E) intel_pch_thermal(E) dax_pmem(E) acpi_power_meter(E) fuse(E) drm(E) xfs(E) libcrc32c(E) sd_mod(E) sg(E) lpfc(E) nvmet_fc(E) mlx5_core(E) nvmet(E) mlxfw(E) nvme_fc(E) nvme_fabrics(E) crct10dif_pclmul(E) crc32_pclmul(E) tls(E) crc32c_intel(E) nvme_core(E) ahci(E) t10_pi(E) psample(E) ghash_clmulni_intel(E) libahci(E) pci_hyperv_intf(E) megaraid_sas(E) scsi_transport_fc(E) bnxt_en(E) tg3(E) nfit(E) libata(E) wmi(E) libnvdimm(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E)