hugepage/swap: kernel BUG at mm/swap_state.c:154!

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Willy,

I'm seeing a softlockup issue in the madvise() pageout -> reclaim
codepath that turned into the VM_BUG_ON() splat below[1] with debug
enabled. The bug corresponds to the following code in
__delete_from_swap_cache():

	...
        for (i = 0; i < nr; i++) {
                void *entry = xas_store(&xas, shadow);
                VM_BUG_ON_FOLIO(entry != folio, folio);
                set_page_private(folio_page(folio, i), 0);
                xas_next(&xas);
        }
	...

The immediate reason for failure is because the swap entry is zero, so
the entry passed in from the caller (via folio->private) looks bogus.
This page was originally added to swapcache as a 2MB hugepage, then is
being split here and each subpage removed/freed via this split call. The
splat occurs attempting to remove the first subpage.

It looks like the reason the swapentry is lost is page->private being
cleared a bit earlier in __split_huge_page_tail(). This was added via
commit b653db77350c7 ("mm: Clear page->private when splitting or
migrating a page"). I don't have context for the problem fixed by that
patch, but (so far) the following tweak seems to address both issues
I've seen (so I don't have detailed root cause of the soft lockup
variant, but from testing it appears to be a side effect of this
problem):

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e9414ee57c5b..c2ddbb81a743 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2445,7 +2445,8 @@ static void __split_huge_page_tail(struct page *head, int tail,
 			page_tail);
 	page_tail->mapping = head->mapping;
 	page_tail->index = head->index + tail;
-	page_tail->private = 0;
+	if (!PageSwapCache(page_tail))
+		page_tail->private = 0;
 
 	/* Page flags must be visible before we make the page non-compound. */
 	smp_wmb();

Thoughts? If this makes sense I can send it as a proper patch..

Brian

[1] bug splat:

page:000000001c1895ba refcount:2 mapcount:0 mapping:00000000164a725a index:0x7f6441401 pfn:0x1d07a01
memcg:ff4ec5f22893a000
anon flags: 0x17ffffc008043d(locked|uptodate|dirty|lru|active|owner_priv_1|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
raw: 0017ffffc008043d ffd64cba341e8008 ffd64cba341e8088 ff4ec5f20ea6e791
raw: 00000007f6441401 0000000000000000 00000002ffffffff ff4ec5f22893a000
page dumped because: VM_BUG_ON_FOLIO(entry != folio)
------------[ cut here ]------------
kernel BUG at mm/swap_state.c:154!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 34 PID: 12321 Comm: stress-ng Kdump: loaded Tainted: G            E      6.0.0-rc3+ #4
Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.2.4 05/28/2021
RIP: 0010:__delete_from_swap_cache+0x21c/0x250
Code: 04 25 28 00 00 00 75 46 48 83 c4 40 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 48 c7 c6 f8 ad 59 96 4c 89 f7 e8 14 e3 fb ff <0f> 0b 48 c7 c6 f0 34 59 96 4c 89 f7 e8 03 e3 fb ff 0f 0b 48 c7 c6
RSP: 0018:ff8dce04e9267878 EFLAGS: 00010046
RAX: 0000000000000034 RBX: 0000000000000000 RCX: 0000000000000027
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ff4ec5f0bfc5f860
RBP: 0000000000000001 R08: 0000000000000000 R09: 00000000ffff7fff
R10: ff8dce04e9267708 R11: ffffffff96fe7368 R12: ff4ec5b2a1e88000
R13: 0000000000000001 R14: ffd64cba341e8040 R15: 0000000000000000
FS:  00007f660b421740(0000) GS:ff4ec5f0bfc40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6498aa6000 CR3: 00000001764ba006 CR4: 0000000000771ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <TASK>
 delete_from_swap_cache+0x4c/0xc0
 try_to_free_swap+0x115/0x160
 free_swap_cache+0x7f/0xc0
 free_page_and_swap_cache+0xf/0xd0
 __split_huge_page+0x4b5/0x780
 split_huge_page_to_list+0x6f9/0xa80
 madvise_cold_or_pageout_pte_range+0x433/0xd90
 ? sysvec_call_function_single+0x41/0x90
 walk_pmd_range.isra.0+0xc3/0x320
 walk_pud_range.isra.0+0x137/0x250
 walk_p4d_range+0x10b/0x170
 walk_pgd_range+0x11e/0x180
 __walk_page_range+0x56/0x1a0
 walk_page_range+0xaa/0x130
 madvise_pageout+0xf6/0x170
 ? rseq_get_rseq_cs.isra.0+0x16/0x220
 madvise_vma_behavior+0x44d/0x6c0
 ? find_vma+0x20/0x80
 do_madvise.part.0+0x1a7/0x330
 __x64_sys_madvise+0x5a/0x70
 do_syscall_64+0x59/0x90
 ? ktime_get+0x35/0xa0
 ? clockevents_program_event+0x92/0x100
 ? hrtimer_interrupt+0x126/0x210
 ? sched_clock_cpu+0x9/0xb0
 ? irqtime_account_irq+0x3c/0xb0
 ? __irq_exit_rcu+0x46/0xe0
 ? sysvec_apic_timer_interrupt+0x3c/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f660b23eeeb
Code: 73 01 c3 48 8b 0d 35 af 1b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 1c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 05 af 1b 00 f7 d8 64 89 01 48
RSP: 002b:00007ffd3bfcdee8 EFLAGS: 00000202 ORIG_RAX: 000000000000001c
RAX: ffffffffffffffda RBX: 00007ffd3bfce0d0 RCX: 00007f660b23eeeb
RDX: 0000000000000015 RSI: 0000000257b3f000 RDI: 00007f63b1082000
RBP: 00007f63b1082000 R08: 0000000000000000 R09: 00000000000000cb
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000257b3f000
R13: 00007f6608bc1000 R14: 00007ffd3bfcdff0 R15: 0000000000000000
 </TASK>
Modules linked in: rfkill(E) sunrpc(E) intel_rapl_msr(E) intel_rapl_common(E) intel_uncore_frequency(E) intel_uncore_frequency_common(E) i10nm_edac(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) ipmi_ssif(E) coretemp(E) kvm_intel(E) mgag200(E) i2c_algo_bit(E) drm_shmem_helper(E) mlx5_ib(E) kvm(E) dcdbas(E) irqbypass(E) drm_kms_helper(E) ib_uverbs(E) rapl(E) acpi_ipmi(E) intel_cstate(E) ipmi_si(E) syscopyarea(E) mei_me(E) dell_smbios(E) ib_core(E) sysfillrect(E) ipmi_devintf(E) intel_uncore(E) nd_pmem(E) wmi_bmof(E) pcspkr(E) dell_wmi_descriptor(E) sysimgblt(E) i2c_i801(E) isst_if_mbox_pci(E) isst_if_mmio(E) intel_vsec(E) isst_if_common(E) fb_sys_fops(E) i2c_smbus(E) mei(E) ipmi_msghandler(E) nd_btt(E) intel_pch_thermal(E) dax_pmem(E) acpi_power_meter(E) fuse(E) drm(E) xfs(E) libcrc32c(E) sd_mod(E) sg(E) lpfc(E) nvmet_fc(E) mlx5_core(E) nvmet(E) mlxfw(E) nvme_fc(E) nvme_fabrics(E) crct10dif_pclmul(E) crc32_pclmul(E) tls(E) crc32c_intel(E) nvme_core(E) ahci(E) t10_pi(E)
 psample(E) ghash_clmulni_intel(E) libahci(E) pci_hyperv_intf(E) megaraid_sas(E) scsi_transport_fc(E) bnxt_en(E) tg3(E) nfit(E) libata(E) wmi(E) libnvdimm(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E)





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux