Hi again, We had another example of xarray corruption involving xfs and zsmalloc. We are running zram as swap. We have 2 tasks deadlock waiting for page to be released The following backtrace is from zsmalloc task #0 context_switch (/cfsetup_build/build/linux/kernel/sched/core.c:5241:2) #1 __schedule (/cfsetup_build/build/linux/kernel/sched/core.c:6554:8) #2 schedule (/cfsetup_build/build/linux/kernel/sched/core.c:6630:3) #3 io_schedule (/cfsetup_build/build/linux/kernel/sched/core.c:8774:2) #4 folio_wait_bit_common (/cfsetup_build/build/linux/mm/filemap.c:1296:4) #5 folio_wait_locked (/cfsetup_build/build/linux/include/linux/pagemap.h:1022:3) #6 wait_on_page_locked (/cfsetup_build/build/linux/include/linux/pagemap.h:1034:2) #7 lock_zspage (/cfsetup_build/build/linux/mm/zsmalloc.c:1736:3) #8 async_free_zspage (/cfsetup_build/build/linux/mm/zsmalloc.c:1974:3) #9 process_one_work (/cfsetup_build/build/linux/kernel/workqueue.c:2289:2) #10 worker_thread (/cfsetup_build/build/linux/kernel/workqueue.c:2436:4) #11 kthread (/cfsetup_build/build/linux/kernel/kthread.c:376:9) #12 ret_from_fork+0x22/0x2d (/cfsetup_build/build/linux/arch/x86/entry/entry_64.S:306) The following backtrace is from a userspace task #0 context_switch (/cfsetup_build/build/linux/kernel/sched/core.c:5241:2) #1 __schedule (/cfsetup_build/build/linux/kernel/sched/core.c:6554:8) #2 schedule (/cfsetup_build/build/linux/kernel/sched/core.c:6630:3) #3 io_schedule (/cfsetup_build/build/linux/kernel/sched/core.c:8774:2) #4 folio_wait_bit_common (/cfsetup_build/build/linux/mm/filemap.c:1296:4) #5 folio_put_wait_locked (/cfsetup_build/build/linux/mm/filemap.c:1465:9) #6 filemap_update_page (/cfsetup_build/build/linux/mm/filemap.c:2472:4) #7 filemap_get_pages (/cfsetup_build/build/linux/mm/filemap.c:2606:9) #8 filemap_read (/cfsetup_build/build/linux/mm/filemap.c:2676:11) #9 xfs_file_buffered_read (/cfsetup_build/build/linux/fs/xfs/xfs_file.c:277:8) #10 xfs_file_read_iter (/cfsetup_build/build/linux/fs/xfs/xfs_file.c:302:9) #11 call_read_iter (/cfsetup_build/build/linux/include/linux/fs.h:2199:9) #12 new_sync_read (/cfsetup_build/build/linux/fs/read_write.c:389:8) #13 vfs_read (/cfsetup_build/build/linux/fs/read_write.c:470:9) #14 ksys_read (/cfsetup_build/build/linux/fs/read_write.c:613:9) #15 do_syscall_x64 (/cfsetup_build/build/linux/arch/x86/entry/common.c:50:14) #16 do_syscall_64 (/cfsetup_build/build/linux/arch/x86/entry/common.c:80:7) #17 entry_SYSCALL_64+0x83/0x164 (/cfsetup_build/build/linux/arch/x86/entry/entry_64.S:120) The folio in question has .mapping = (struct address_space *)zsmalloc_mops+0x2 = 0xffffffffc1a9f332 and flag 'PG_locked|PG_waiters|PG_private|PG_slob_free'. In fact, the file's i_pages mapping has a node full of these pages. The following are entries we get from mapping in #6 at 0xffffffffa4e1c586 (filemap_get_pages+0x5d6/0x624) in filemap_update_page at /cfsetup_build/build/linux/mm/filemap.c:2472:4 (inlined) > for index, entry in xa_for_each(trace[6]['mapping'].i_pages.address_of_()): print(index, entry, cast('struct folio *', entry).page.mapping.address_of_()) 2936 (void *)0xffffe53ab6454f00 *(struct address_space **)0xffffe53ab6454f18 = 0xffff9ffc9ded16b0 2940 (void *)0xffffe53ab6454300 *(struct address_space **)0xffffe53ab6454318 = 0xffff9ffc9ded16b0 2944 (void *)0xffffe53a02696000 *(struct address_space **)0xffffe53a02696018 = zsmalloc_mops+0x2 = 0xffffffffc1a9f332 <== index 2945 (void *)0xffffe53a02696000 *(struct address_space **)0xffffe53a02696018 = zsmalloc_mops+0x2 = 0xffffffffc1a9f332 2946 (void *)0xffffe53a02696000 *(struct address_space **)0xffffe53a02696018 = zsmalloc_mops+0x2 = 0xffffffffc1a9f332 ... 2976 (void *)0xffffe53a02696000 *(struct address_space **)0xffffe53a02696018 = zsmalloc_mops+0x2 = 0xffffffffc1a9f332 <== last_index ... 3006 (void *)0xffffe53a02696000 *(struct address_space **)0xffffe53a02696018 = zsmalloc_mops+0x2 = 0xffffffffc1a9f332 3007 (void *)0xffffe53ad71c37c0 *(struct address_space **)0xffffe53ad71c37d8 = 0xffff9ffc9ded16b0 On Fri, Jul 21, 2023 at 11:49 AM Daniel Dao <dqminh@xxxxxxxxxxxxxx> wrote: > > Hi, > > In the past, we reported some corruptions on xfs/iomap/xarray combinations on > kernel 6.1. This happened very rarely ( once a week for every 10000 hosts), and > the host exhibited symptoms such as: rcu_preempt self-detected stalls, > NULL pointer > dereferences or deadlock when reading a particular file. > > We do not have a reproducer yet, but we now have more debugging data > which hopefully > should help narrow this down. Details as followed: > > 1. Kernel NULL pointer deferencences in __filemap_get_folio > > This happened on a few different hosts, with a few different repeated addresses. > The addresses are 0000000000000036, 0000000000000076, > 00000000000000f6. This looks > like the xarray is corrupted and we were trying to do some work on a > sibling entry. > > BUG: kernel NULL pointer dereference, address: 0000000000000036 > #PF: supervisor read access in kernel mode > #PF: error_code(0x0000) - not-present page > PGD 18806c5067 P4D 18806c5067 PUD 188ed48067 PMD 0 > Oops: 0000 [#1] PREEMPT SMP NOPTI > CPU: 73 PID: 3579408 Comm: prometheus Tainted: G O > 6.1.34-cloudflare-2023.6.7 #1 > Hardware name: GIGABYTE R162-Z12-CD1/MZ12-HD4-CD, BIOS M03 11/19/2021 > RIP: 0010:__filemap_get_folio (arch/x86/include/asm/atomic.h:29 > include/linux/atomic/atomic-arch-fallback.h:1242 > include/linux/atomic/atomic-arch-fallback.h:1267 > include/linux/atomic/atomic-instrumented.h:608 > include/linux/page_ref.h:238 include/linux/page_ref.h:247 > include/linux/page_ref.h:280 include/linux/page_ref.h:313 > mm/filemap.c:1863 mm/filemap.c:1915) > Code: 10 e8 99 ac 84 00 48 3d 06 04 00 00 49 89 c4 74 e2 48 3d 02 > 04 00 00 74 da 48 85 c0 0f 84 2e 02 00 00 a8 01 0f 85 e3 00 00 00 <8b> > 40 34 85 c0 74 c2 8d 50 01 4d 8d 7c 24 34 f0 41 0f b1 54 24 34 > All code > ======== > 0: 10 e8 adc %ch,%al > 2: 99 cltd > 3: ac lods %ds:(%rsi),%al > 4: 84 00 test %al,(%rax) > 6: 48 3d 06 04 00 00 cmp $0x406,%rax > c: 49 89 c4 mov %rax,%r12 > f: 74 e2 je 0xfffffffffffffff3 > 11: 48 3d 02 04 00 00 cmp $0x402,%rax > 17: 74 da je 0xfffffffffffffff3 > 19: 48 85 c0 test %rax,%rax > 1c: 0f 84 2e 02 00 00 je 0x250 > 22: a8 01 test $0x1,%al > 24: 0f 85 e3 00 00 00 jne 0x10d > 2a:* 8b 40 34 mov 0x34(%rax),%eax <-- trapping instruction > 2d: 85 c0 test %eax,%eax > 2f: 74 c2 je 0xfffffffffffffff3 > 31: 8d 50 01 lea 0x1(%rax),%edx > 34: 4d 8d 7c 24 34 lea 0x34(%r12),%r15 > 39: f0 41 0f b1 54 24 34 lock cmpxchg %edx,0x34(%r12) > > Code starting with the faulting instruction > =========================================== > 0: 8b 40 34 mov 0x34(%rax),%eax > 3: 85 c0 test %eax,%eax > 5: 74 c2 je 0xffffffffffffffc9 > 7: 8d 50 01 lea 0x1(%rax),%edx > a: 4d 8d 7c 24 34 lea 0x34(%r12),%r15 > f: f0 41 0f b1 54 24 34 lock cmpxchg %edx,0x34(%r12) > RSP: 0000:ffffaf5587cdfc60 EFLAGS: 00010246 > RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000002 > RDX: 0000000000000008 RSI: ffffa45181fa8000 RDI: ffffaf5587cdfc70 > RBP: 0000000000000000 R08: 0000000000000402 R09: 000000000006e44f > R10: 000000000006e450 R11: 000000000006e448 R12: 0000000000000002 > R13: ffffa3fff6fdfeb0 R14: 000000000006e44a R15: 00000000000000d1 > FS: 000000c9e385ac90(0000) GS:ffffa4153fc40000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000036 CR3: 000000296a1bc002 CR4: 0000000000770ee0 > PKRU: 55555554 > Call Trace: > <TASK> > ? __die_body.cold (arch/x86/kernel/dumpstack.c:478 > arch/x86/kernel/dumpstack.c:465 arch/x86/kernel/dumpstack.c:420) > ? page_fault_oops (arch/x86/mm/fault.c:727) > ? migrate_task_rq_fair (include/linux/sched.h:1921 > kernel/sched/fair.c:3932 kernel/sched/fair.c:7497) > ? do_user_addr_fault (include/linux/kprobes.h:404 > include/linux/kprobes.h:597 arch/x86/mm/fault.c:1280) > ? ttwu_queue_wakelist (kernel/sched/core.c:3880) > ? exc_page_fault (arch/x86/include/asm/irqflags.h:40 > arch/x86/include/asm/irqflags.h:75 arch/x86/mm/fault.c:1527 > arch/x86/mm/fault.c:1575) > ? asm_exc_page_fault (arch/x86/include/asm/idtentry.h:570) > ? __filemap_get_folio (arch/x86/include/asm/atomic.h:29 > include/linux/atomic/atomic-arch-fallback.h:1242 > include/linux/atomic/atomic-arch-fallback.h:1267 > include/linux/atomic/atomic-instrumented.h:608 > include/linux/page_ref.h:238 include/linux/page_ref.h:247 > include/linux/page_ref.h:280 include/linux/page_ref.h:313 > mm/filemap.c:1863 mm/filemap.c:1915) > filemap_fault (mm/filemap.c:3120) > ? preempt_count_add (include/linux/ftrace.h:950 > kernel/sched/core.c:5685 kernel/sched/core.c:5682 > kernel/sched/core.c:5710) > __do_fault (mm/memory.c:4234) > do_fault (mm/memory.c:4564 mm/memory.c:4692) > __handle_mm_fault (mm/memory.c:4964 mm/memory.c:5106) > handle_mm_fault (mm/memory.c:5227) > do_user_addr_fault (include/linux/sched/signal.h:433 > arch/x86/mm/fault.c:1430) > exc_page_fault (arch/x86/include/asm/irqflags.h:40 > arch/x86/include/asm/irqflags.h:75 arch/x86/mm/fault.c:1527 > arch/x86/mm/fault.c:1575) > asm_exc_page_fault (arch/x86/include/asm/idtentry.h:570) > RIP: 0033:0x268b8b9 > Code: 70 48 89 4c 24 78 48 8b 94 24 b8 00 00 00 0f 1f 00 48 85 d2 > 74 3f 48 89 ce 48 29 d9 4c 8d 49 04 49 f7 d9 49 c1 f9 3f 49 21 f9 <46> > 8b 0c 08 44 89 4c 24 34 90 90 48 89 d3 48 89 c1 41 b8 01 00 00 > All code > ======== > 0: 70 48 jo 0x4a > 2: 89 4c 24 78 mov %ecx,0x78(%rsp) > 6: 48 8b 94 24 b8 00 00 mov 0xb8(%rsp),%rdx > d: 00 > e: 0f 1f 00 nopl (%rax) > 11: 48 85 d2 test %rdx,%rdx > 14: 74 3f je 0x55 > 16: 48 89 ce mov %rcx,%rsi > 19: 48 29 d9 sub %rbx,%rcx > 1c: 4c 8d 49 04 lea 0x4(%rcx),%r9 > 20: 49 f7 d9 neg %r9 > 23: 49 c1 f9 3f sar $0x3f,%r9 > 27: 49 21 f9 and %rdi,%r9 > 2a:* 46 8b 0c 08 mov (%rax,%r9,1),%r9d <-- trapping > instruction > 2e: 44 89 4c 24 34 mov %r9d,0x34(%rsp) > 33: 90 nop > 34: 90 nop > 35: 48 89 d3 mov %rdx,%rbx > 38: 48 89 c1 mov %rax,%rcx > 3b: 41 rex.B > 3c: b8 .byte 0xb8 > 3d: 01 00 add %eax,(%rax) > ... > > Code starting with the faulting instruction > =========================================== > 0: 46 8b 0c 08 mov (%rax,%r9,1),%r9d > 4: 44 89 4c 24 34 mov %r9d,0x34(%rsp) > 9: 90 nop > a: 90 nop > b: 48 89 d3 mov %rdx,%rbx > e: 48 89 c1 mov %rax,%rcx > 11: 41 rex.B > 12: b8 .byte 0xb8 > 13: 01 00 add %eax,(%rax) > ... > RSP: 002b:000000cbc509f520 EFLAGS: 00010202 > RAX: 00007e81cf427e0c RBX: 00000000000222cc RCX: 00000000123817b2 > RDX: 000000c00001ac00 RSI: 00000000123a3a7e RDI: 00000000000222c8 > RBP: 000000cbc509f5b0 R08: 0000000003cb5910 R09: 00000000000222c8 > R10: 000000c4de3dea00 R11: 0000000000000123 R12: 0000000000000000 > R13: 0000000000000005 R14: 000000c83bad2340 R15: 0000010000000000 > </TASK> > Modules linked in: xt_connlabel xt_MASQUERADE nf_conntrack_netlink > xfrm_user xfrm_algo xt_addrtype br_netfilter bridge overlay zstd > zstd_compress zram zsmalloc tun tcp_diag inet_diag raid0 md_mod essiv > dm_crypt trusted asn1_encoder tee ip6table_filter ip6table_mangle > ip6table_raw ip6table_security ip6table_nat ip6_tables xt_bpf > xt_conntrack xt_multiport xt_set iptable_filter xt_NFLOG nfnetlink_log > xt_connbytes xt_comment xt_connmark xt_statistic iptable_mangle xt_nat > xt_tcpudp iptable_nat nf_nat xt_CT iptable_raw ip_set_hash_ip > ip_set_hash_net ip_set nfnetlink sch_fq nf_conntrack nf_defrag_ipv6 > nf_defrag_ipv4 8021q garp mrp stp llc bonding nvme_fabrics amd64_edac > kvm_amd ipmi_ssif kvm irqbypass crc32_pclmul crc32c_intel sha512_ssse3 > acpi_ipmi mlx5_core aesni_intel ipmi_si mlxfw rapl xhci_pci nvme tls > ipmi_devintf tiny_power_button psample nvme_core xhci_hcd i2c_piix4 > ccp ipmi_msghandler button fuse dm_mod dax efivarfs ip_tables x_tables > bcmcrypt(O) > crypto_simd cryptd > CR2: 0000000000000036 > ---[ end trace 0000000000000000 ]--- > RIP: 0010:__filemap_get_folio (arch/x86/include/asm/atomic.h:29 > include/linux/atomic/atomic-arch-fallback.h:1242 > include/linux/atomic/atomic-arch-fallback.h:1267 > include/linux/atomic/atomic-instrumented.h:608 > include/linux/page_ref.h:238 include/linux/page_ref.h:247 > include/linux/page_ref.h:280 include/linux/page_ref.h:313 > mm/filemap.c:1863 mm/filemap.c:1915) > Code: 10 e8 99 ac 84 00 48 3d 06 04 00 00 49 89 c4 74 e2 48 3d 02 > 04 00 00 74 da 48 85 c0 0f 84 2e 02 00 00 a8 01 0f 85 e3 00 00 00 <8b> > 40 34 85 c0 74 c2 8d 50 01 4d 8d 7c 24 34 f0 41 0f b1 54 24 34 > All code > ======== > 0: 10 e8 adc %ch,%al > 2: 99 cltd > 3: ac lods %ds:(%rsi),%al > 4: 84 00 test %al,(%rax) > 6: 48 3d 06 04 00 00 cmp $0x406,%rax > c: 49 89 c4 mov %rax,%r12 > f: 74 e2 je 0xfffffffffffffff3 > 11: 48 3d 02 04 00 00 cmp $0x402,%rax > 17: 74 da je 0xfffffffffffffff3 > 19: 48 85 c0 test %rax,%rax > 1c: 0f 84 2e 02 00 00 je 0x250 > 22: a8 01 test $0x1,%al > 24: 0f 85 e3 00 00 00 jne 0x10d > 2a:* 8b 40 34 mov 0x34(%rax),%eax <-- trapping instruction > 2d: 85 c0 test %eax,%eax > 2f: 74 c2 je 0xfffffffffffffff3 > 31: 8d 50 01 lea 0x1(%rax),%edx > 34: 4d 8d 7c 24 34 lea 0x34(%r12),%r15 > 39: f0 41 0f b1 54 24 34 lock cmpxchg %edx,0x34(%r12) > > Code starting with the faulting instruction > =========================================== > 0: 8b 40 34 mov 0x34(%rax),%eax > 3: 85 c0 test %eax,%eax > 5: 74 c2 je 0xffffffffffffffc9 > 7: 8d 50 01 lea 0x1(%rax),%edx > a: 4d 8d 7c 24 34 lea 0x34(%r12),%r15 > f: f0 41 0f b1 54 24 34 lock cmpxchg %edx,0x34(%r12) > RSP: 0000:ffffaf5587cdfc60 EFLAGS: 00010246 > RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000002 > RDX: 0000000000000008 RSI: ffffa45181fa8000 RDI: ffffaf5587cdfc70 > RBP: 0000000000000000 R08: 0000000000000402 R09: 000000000006e44f > R10: 000000000006e450 R11: 000000000006e448 R12: 0000000000000002 > R13: ffffa3fff6fdfeb0 R14: 000000000006e44a R15: 00000000000000d1 > FS: 000000c9e385ac90(0000) GS:ffffa4153fc40000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000036 CR3: 000000296a1bc002 CR4: 0000000000770ee0 > PKRU: 55555554 > > BUG: kernel NULL pointer dereference, address: 0000000000000076 > #PF: supervisor read access in kernel mode > #PF: error_code(0x0000) - not-present page > PGD 7acd78067 P4D 7acd78067 PUD 7acd79067 PMD 0 > Oops: 0000 [#1] PREEMPT SMP NOPTI > CPU: 93 PID: 3784417 Comm: prometheus Tainted: G O > 6.1.20-cloudflare-2023.3.18 #1 > Hardware name: GIGABYTE R162-Z13-CD/MZ12-HD2-CD, BIOS R13 07/17/2020 > RIP: 0010:__filemap_get_folio (arch/x86/include/asm/atomic.h:29 > include/linux/atomic/atomic-arch-fallback.h:1242 > include/linux/atomic/atomic-arch-fallback.h:1267 > include/linux/atomic/atomic-instrumented.h:608 > include/linux/page_ref.h:238 include/linux/page_ref.h:247 > include/linux/page_ref.h:280 include/linux/page_ref.h:313 > mm/filemap.c:1863 mm/filemap.c:1915) > Code: 10 e8 b9 a4 84 00 48 3d 06 04 00 00 49 89 c4 74 e2 48 3d 02 > 04 00 00 74 da 48 85 c0 0f 84 2e 02 00 00 a8 01 0f 85 e3 00 00 00 <8b> > 40 34 85 c0 74 c2 8d 50 01 4d 8d 7c 24 34 f0 41 0f b1 54 24 34 > All code > ======== > 0: 10 e8 adc %ch,%al > 2: b9 a4 84 00 48 mov $0x480084a4,%ecx > 7: 3d 06 04 00 00 cmp $0x406,%eax > c: 49 89 c4 mov %rax,%r12 > f: 74 e2 je 0xfffffffffffffff3 > 11: 48 3d 02 04 00 00 cmp $0x402,%rax > 17: 74 da je 0xfffffffffffffff3 > 19: 48 85 c0 test %rax,%rax > 1c: 0f 84 2e 02 00 00 je 0x250 > 22: a8 01 test $0x1,%al > 24: 0f 85 e3 00 00 00 jne 0x10d > 2a:* 8b 40 34 mov 0x34(%rax),%eax <-- trapping instruction > 2d: 85 c0 test %eax,%eax > 2f: 74 c2 je 0xfffffffffffffff3 > 31: 8d 50 01 lea 0x1(%rax),%edx > 34: 4d 8d 7c 24 34 lea 0x34(%r12),%r15 > 39: f0 41 0f b1 54 24 34 lock cmpxchg %edx,0x34(%r12) > > Code starting with the faulting instruction > =========================================== > 0: 8b 40 34 mov 0x34(%rax),%eax > 3: 85 c0 test %eax,%eax > 5: 74 c2 je 0xffffffffffffffc9 > 7: 8d 50 01 lea 0x1(%rax),%edx > a: 4d 8d 7c 24 34 lea 0x34(%r12),%r15 > f: f0 41 0f b1 54 24 34 lock cmpxchg %edx,0x34(%r12) > RSP: 0000:ffffb15106683c60 EFLAGS: 00010246 > RAX: 0000000000000042 RBX: 0000000000000000 RCX: 0000000000000002 > RDX: 0000000000000018 RSI: ffff934b0029efc8 RDI: ffffb15106683c70 > RBP: 0000000000000000 R08: 0000000000000402 R09: 00000000000cbe5f > R10: 00000000000cbe60 R11: 00000000000cbe5c R12: 0000000000000042 > R13: ffff93449c251eb0 R14: 00000000000cbe59 R15: 00000000000000d1 > FS: 000000c000300090(0000) GS:ffff937e6ed40000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000076 CR3: 0000000a6528e000 CR4: 0000000000350ee0 > Call Trace: > <TASK> > filemap_fault (mm/filemap.c:3120) > ? preempt_count_add (include/linux/ftrace.h:950 > kernel/sched/core.c:5685 kernel/sched/core.c:5682 > kernel/sched/core.c:5710) > __do_fault (mm/memory.c:4234) > do_fault (mm/memory.c:4564 mm/memory.c:4692) > __handle_mm_fault (mm/memory.c:4964 mm/memory.c:5106) > handle_mm_fault (mm/memory.c:5227) > do_user_addr_fault (include/linux/sched/signal.h:433 > arch/x86/mm/fault.c:1430) > exc_page_fault (arch/x86/include/asm/irqflags.h:40 > arch/x86/include/asm/irqflags.h:75 arch/x86/mm/fault.c:1527 > arch/x86/mm/fault.c:1575) > asm_exc_page_fault (arch/x86/include/asm/idtentry.h:570) > RIP: 0033:0x268b8b9 > Code: 70 48 89 4c 24 78 48 8b 94 24 b8 00 00 00 0f 1f 00 48 85 d2 > 74 3f 48 89 ce 48 29 d9 4c 8d 49 04 49 f7 d9 49 c1 f9 3f 49 21 f9 <46> > 8b 0c 08 44 89 4c 24 34 90 90 48 89 d3 48 89 c1 41 b8 01 00 00 > All code > ======== > 0: 70 48 jo 0x4a > 2: 89 4c 24 78 mov %ecx,0x78(%rsp) > 6: 48 8b 94 24 b8 00 00 mov 0xb8(%rsp),%rdx > d: 00 > e: 0f 1f 00 nopl (%rax) > 11: 48 85 d2 test %rdx,%rdx > 14: 74 3f je 0x55 > 16: 48 89 ce mov %rcx,%rsi > 19: 48 29 d9 sub %rbx,%rcx > 1c: 4c 8d 49 04 lea 0x4(%rcx),%r9 > 20: 49 f7 d9 neg %r9 > 23: 49 c1 f9 3f sar $0x3f,%r9 > 27: 49 21 f9 and %rdi,%r9 > 2a:* 46 8b 0c 08 mov (%rax,%r9,1),%r9d <-- trapping > instruction > 2e: 44 89 4c 24 34 mov %r9d,0x34(%rsp) > 33: 90 nop > 34: 90 nop > 35: 48 89 d3 mov %rdx,%rbx > 38: 48 89 c1 mov %rax,%rcx > 3b: 41 rex.B > 3c: b8 .byte 0xb8 > 3d: 01 00 add %eax,(%rax) > ... > > Code starting with the faulting instruction > =========================================== > 0: 46 8b 0c 08 mov (%rax,%r9,1),%r9d > 4: 44 89 4c 24 34 mov %r9d,0x34(%rsp) > 9: 90 nop > a: 90 nop > b: 48 89 d3 mov %rdx,%rbx > e: 48 89 c1 mov %rax,%rcx > 11: 41 rex.B > 12: b8 .byte 0xb8 > 13: 01 00 add %eax,(%rax) > ... > RSP: 002b:000000d735bb3558 EFLAGS: 00010206 > RAX: 00007c018402dad8 RBX: 000000000002c3d8 RCX: 0000000037f9be1c > RDX: 000000c000222c00 RSI: 0000000037fc81f4 RDI: 000000000002c3d4 > RBP: 000000d735bb35e8 R08: 0000000003cb5910 R09: 000000000002c3d4 > R10: 000000c385d2a000 R11: 0000000000000021 R12: 0000000000000000 > R13: 000000000000000b R14: 000000d1bb70e340 R15: 0000000001000000 > </TASK> > Modules linked in: veth xt_MASQUERADE nf_conntrack_netlink > xfrm_user xfrm_algo xt_addrtype br_netfilter bridge overlay raid1 > md_mod essiv dm_crypt trusted tee asn1_encoder xt_hl ip6table_filter > ip6table_mangle ip6table_raw ip6table_security ip6table_nat ip6_tables > xt_tcpudp xt_conntrack xt_comment xt_multiport xt_set iptable_filter > iptable_mangle iptable_nat nf_nat xt_CT iptable_raw ip_set_hash_ip > ip_set_hash_net ip_set nfnetlink tcp_bbr sch_fq nf_conntrack > nf_defrag_ipv6 nf_defrag_ipv4 8021q mrp garp stp llc bonding > amd64_edac kvm_amd ipmi_ssif kvm irqbypass crc32_pclmul crc32c_intel > mlx5_core sha512_ssse3 psample acpi_ipmi aesni_intel xhci_pci nvme > ipmi_si rapl tls ipmi_devintf tiny_power_button nvme_core mlxfw > xhci_hcd i2c_piix4 ccp ipmi_msghandler button fuse dm_mod dax efivarfs > ip_tables x_tables bcmcrypt(O) crypto_simd cryptd > CR2: 0000000000000076 > ---[ end trace 0000000000000000 ]--- > RIP: 0010:__filemap_get_folio (arch/x86/include/asm/atomic.h:29 > include/linux/atomic/atomic-arch-fallback.h:1242 > include/linux/atomic/atomic-arch-fallback.h:1267 > include/linux/atomic/atomic-instrumented.h:608 > include/linux/page_ref.h:238 include/linux/page_ref.h:247 > include/linux/page_ref.h:280 include/linux/page_ref.h:313 > mm/filemap.c:1863 mm/filemap.c:1915) > Code: 10 e8 b9 a4 84 00 48 3d 06 04 00 00 49 89 c4 74 e2 48 3d 02 > 04 00 00 74 da 48 85 c0 0f 84 2e 02 00 00 a8 01 0f 85 e3 00 00 00 <8b> > 40 34 85 c0 74 c2 8d 50 01 4d 8d 7c 24 34 f0 41 0f b1 54 24 34 > All code > ======== > 0: 10 e8 adc %ch,%al > 2: b9 a4 84 00 48 mov $0x480084a4,%ecx > 7: 3d 06 04 00 00 cmp $0x406,%eax > c: 49 89 c4 mov %rax,%r12 > f: 74 e2 je 0xfffffffffffffff3 > 11: 48 3d 02 04 00 00 cmp $0x402,%rax > 17: 74 da je 0xfffffffffffffff3 > 19: 48 85 c0 test %rax,%rax > 1c: 0f 84 2e 02 00 00 je 0x250 > 22: a8 01 test $0x1,%al > 24: 0f 85 e3 00 00 00 jne 0x10d > 2a:* 8b 40 34 mov 0x34(%rax),%eax <-- trapping instruction > 2d: 85 c0 test %eax,%eax > 2f: 74 c2 je 0xfffffffffffffff3 > 31: 8d 50 01 lea 0x1(%rax),%edx > 34: 4d 8d 7c 24 34 lea 0x34(%r12),%r15 > 39: f0 41 0f b1 54 24 34 lock cmpxchg %edx,0x34(%r12) > > Code starting with the faulting instruction > =========================================== > 0: 8b 40 34 mov 0x34(%rax),%eax > 3: 85 c0 test %eax,%eax > 5: 74 c2 je 0xffffffffffffffc9 > 7: 8d 50 01 lea 0x1(%rax),%edx > a: 4d 8d 7c 24 34 lea 0x34(%r12),%r15 > f: f0 41 0f b1 54 24 34 lock cmpxchg %edx,0x34(%r12) > RSP: 0000:ffffb15106683c60 EFLAGS: 00010246 > RAX: 0000000000000042 RBX: 0000000000000000 RCX: 0000000000000002 > RDX: 0000000000000018 RSI: ffff934b0029efc8 RDI: ffffb15106683c70 > RBP: 0000000000000000 R08: 0000000000000402 R09: 00000000000cbe5f > R10: 00000000000cbe60 R11: 00000000000cbe5c R12: 0000000000000042 > R13: ffff93449c251eb0 R14: 00000000000cbe59 R15: 00000000000000d1 > FS: 000000c000300090(0000) GS:ffff937e6ed40000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000076 CR3: 0000000a6528e000 CR4: 0000000000350ee0 > note: prometheus[3784417] exited with irqs disabled > > 2. Kernel NULL pointer deferencences in xfs_read_iomap_begin > > BUG: unable to handle page fault for address: 0000000000034668 > #PF: supervisor read access in kernel mode > #PF: error_code(0x0000) - not-present page > PGD 11cfd37067 P4D 11cfd37067 PUD b88086067 PMD 0 > Oops: 0000 [#1] PREEMPT SMP NOPTI > CPU: 124 PID: 3831226 Comm: rocksdb:low Kdump: loaded Tainted: G > W O L 6.1.27-cloudflare-2023.5.0 #1 > Hardware name: HYVE EDGE-METAL-GEN11/HS1811D_Lite, BIOS V0.11-sig 12/23/2022 > RIP: 0010:xfs_read_iomap_begin (fs/xfs/xfs_iomap.c:1200) > Code: 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 83 ec 50 48 > 89 14 24 4c 89 44 24 08 65 48 8b 04 25 28 00 00 00 48 89 44 24 48 <48> > 8b 87 > > All code > ======== > 0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > 5: 41 57 push %r15 > 7: 41 56 push %r14 > 9: 41 55 push %r13 > b: 41 54 push %r12 > d: 55 push %rbp > e: 53 push %rbx > f: 48 83 ec 50 sub $0x50,%rsp > 13: 48 89 14 24 mov %rdx,(%rsp) > 17: 4c 89 44 24 08 mov %r8,0x8(%rsp) > 1c: 65 48 8b 04 25 28 00 mov %gs:0x28,%rax > 23: 00 00 > 25: 48 89 44 24 48 mov %rax,0x48(%rsp) > 2a:* 48 rex.W <-- trapping instruction > 2b: 8b .byte 0x8b > 2c: 87 00 xchg %eax,(%rax) > > Code starting with the faulting instruction > =========================================== > 0: 48 rex.W > 1: 8b .byte 0x8b > 2: 87 00 xchg %eax,(%rax) > RSP: 0018:ffffa63810733a70 EFLAGS: 00010282 > RAX: 78ac714f0997e100 RBX: ffffa63810733b40 RCX: 0000000000000000 > RDX: 0000000000004000 RSI: 0000000000000000 RDI: 00000000000347a0 > RBP: ffffffff8664d950 R08: ffffa63810733b68 R09: ffffa63810733bb0 > R10: 000000000001f627 R11: 0000000000000000 R12: ffffa63810733b68 > R13: ffffa63810733bb0 R14: 00000000000019c1 R15: 00000000fffffff5 > FS: 00007f48d8504700(0000) GS:ffffa2fe5ef00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000034668 CR3: 00000013037ec001 CR4: 0000000000770ee0 > PKRU: 55555554 > Call Trace: > <TASK> > ? __mod_memcg_lruvec_state (mm/memcontrol.c:613 mm/memcontrol.c:799) > iomap_iter (fs/iomap/iter.c:76) > iomap_read_folio (fs/iomap/buffered-io.c:342) > ? xfs_end_bio (fs/xfs/xfs_aops.c:542) > filemap_read_folio (mm/filemap.c:2407) > filemap_get_pages (mm/filemap.c:2492 mm/filemap.c:2606) > filemap_read (mm/filemap.c:2677) > xfs_file_buffered_read (fs/xfs/xfs_file.c:278) > xfs_file_read_iter (fs/xfs/xfs_file.c:304) > vfs_read (fs/read_write.c:390 fs/read_write.c:470) > __x64_sys_pread64 (include/linux/file.h:44 fs/read_write.c:666 > fs/read_write.c:675 fs/read_write.c:672 fs/read_write.c:672) > do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) > entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) > RIP: 0033:0x7f49061ca917 > Code: 08 89 3c 24 48 89 4c 24 18 e8 05 f4 ff ff 4c 8b 54 24 18 48 > 8b 54 24 10 41 89 c0 48 8b 74 24 08 8b 3c 24 b8 11 00 00 00 0f 05 <48> > 3d 00 > > All code > ======== > 0: 08 89 3c 24 48 89 or %cl,-0x76b7dbc4(%rcx) > 6: 4c 24 18 rex.WR and $0x18,%al > 9: e8 05 f4 ff ff call 0xfffffffffffff413 > e: 4c 8b 54 24 18 mov 0x18(%rsp),%r10 > 13: 48 8b 54 24 10 mov 0x10(%rsp),%rdx > 18: 41 89 c0 mov %eax,%r8d > 1b: 48 8b 74 24 08 mov 0x8(%rsp),%rsi > 20: 8b 3c 24 mov (%rsp),%edi > 23: b8 11 00 00 00 mov $0x11,%eax > 28: 0f 05 syscall > 2a:* 48 rex.W <-- trapping instruction > 2b: 3d .byte 0x3d > ... > > Code starting with the faulting instruction > =========================================== > 0: 48 rex.W > 1: 3d .byte 0x3d > ... > RSP: 002b:00007f48d84ffc70 EFLAGS: 00000293 ORIG_RAX: 0000000000000011 > RAX: ffffffffffffffda RBX: 00000000018a0c90 RCX: 00007f49061ca917 > RDX: 00000000000c294f RSI: 000000002265e000 RDI: 000000000000003c > RBP: 00007f48d84ffda0 R08: 0000000000000000 R09: 00007f48d84ffe60 > R10: 000000000191dcd8 R11: 0000000000000293 R12: 0000000007c3c6c0 > R13: 00000000000c294f R14: 00000000000c294f R15: 000000000191dcd8 > </TASK> > Modules linked in: xt_connlabel overlay nft_compat esp4 > xt_hashlimit ip_set_hash_netport xt_length nf_conntrack_netlink > mpls_gso mpls_iptunnel > > tcp_bbr sch_fq nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 8021q > garp mrp stp llc ipmi_ssif amd64_edac kvm_amd kvm irqbypass > crc32_pclmul crc32> > CR2: 0000000000034668 > ---[ end trace 0000000000000000 ]--- > > We also have a deadlock reading a very specific file on this host. We managed to > do a kdump on this host and extracted out the state of the mapping. > > > >>> trace > #0 context_switch (/cfsetup_build/build/linux/kernel/sched/core.c:5241:2) > #1 __schedule (/cfsetup_build/build/linux/kernel/sched/core.c:6554:8) > #2 schedule (/cfsetup_build/build/linux/kernel/sched/core.c:6630:3) > #3 io_schedule (/cfsetup_build/build/linux/kernel/sched/core.c:8774:2) > #4 folio_wait_bit_common (/cfsetup_build/build/linux/mm/filemap.c:1296:4) > #5 folio_put_wait_locked (/cfsetup_build/build/linux/mm/filemap.c:1465:9) > #6 filemap_update_page (/cfsetup_build/build/linux/mm/filemap.c:2472:4) > #7 filemap_get_pages (/cfsetup_build/build/linux/mm/filemap.c:2606:9) > #8 filemap_read (/cfsetup_build/build/linux/mm/filemap.c:2676:11) > #9 xfs_file_buffered_read > (/cfsetup_build/build/linux/fs/xfs/xfs_file.c:277:8) > #10 xfs_file_read_iter (/cfsetup_build/build/linux/fs/xfs/xfs_file.c:302:9) > #11 call_read_iter (/cfsetup_build/build/linux/include/linux/fs.h:2199:9) > #12 new_sync_read (/cfsetup_build/build/linux/fs/read_write.c:389:8) > #13 vfs_read (/cfsetup_build/build/linux/fs/read_write.c:470:9) > #14 ksys_read (/cfsetup_build/build/linux/fs/read_write.c:613:9) > #15 do_syscall_x64 > (/cfsetup_build/build/linux/arch/x86/entry/common.c:50:14) > #16 do_syscall_64 (/cfsetup_build/build/linux/arch/x86/entry/common.c:80:7) > #17 entry_SYSCALL_64+0x83/0x164 > (/cfsetup_build/build/linux/arch/x86/entry/entry_64.S:120) > #18 0x7f05f0b093ce > >>> folio = trace[6]['folio'] > >>> decode_page_flags(folio) > 'PG_locked|PG_waiters|PG_head' > >>> folio > *(struct folio *)0xffffd67406346000 = { > .flags = (unsigned long)13510764522438785, > .lru = (struct list_head){ > .next = (struct list_head *)0xdead000000000100, > .prev = (struct list_head *)0xdead000000000122, > }, > .__filler = (void *)0xdead000000000100, > .mlock_count = (unsigned int)290, > .mapping = (struct address_space *)0x0, > .index = (unsigned long)18446641474676726016, > .private = (void *)0x400000, > ._mapcount = (atomic_t){ > .counter = (int)-1, > }, > ._refcount = (atomic_t){ > .counter = (int)1, > }, > .memcg_data = (unsigned long)0, > .page = (struct page){ > .flags = (unsigned long)13510764522438785, > .lru = (struct list_head){ > .next = (struct list_head *)0xdead000000000100, > .prev = (struct list_head *)0xdead000000000122, > }, > .__filler = (void *)0xdead000000000100, > .mlock_count = (unsigned int)290, > .buddy_list = (struct list_head){ > .next = (struct list_head *)0xdead000000000100, > .prev = (struct list_head *)0xdead000000000122, > }, > .pcp_list = (struct list_head){ > .next = (struct list_head *)0xdead000000000100, > .prev = (struct list_head *)0xdead000000000122, > }, > .mapping = (struct address_space *)0x0, > .index = (unsigned long)18446641474676726016, > .private = (unsigned long)4194304, > .pp_magic = (unsigned long)16045481047390945536, > .pp = (struct page_pool *)0xdead000000000122, > ._pp_mapping_pad = (unsigned long)0, > .dma_addr = (unsigned long)18446641474676726016, > .dma_addr_upper = (unsigned long)4194304, > .pp_frag_count = (atomic_long_t){ > .counter = (s64)4194304, > }, > .compound_head = (unsigned long)16045481047390945536, > .compound_dtor = (unsigned char)34, > .compound_order = (unsigned char)1, > .compound_mapcount = (atomic_t){ > .counter = (int)-559087616, > }, > .compound_pincount = (atomic_t){ > .counter = (int)0, > }, > .compound_nr = (unsigned int)0, > ._compound_pad_1 = (unsigned long)16045481047390945536, > ._compound_pad_2 = (unsigned long)16045481047390945570, > .deferred_list = (struct list_head){ > .next = (struct list_head *)0x0, > .prev = (struct list_head *)0xffffa2afcd181900, > }, > ._pt_pad_1 = (unsigned long)16045481047390945536, > .pmd_huge_pte = (pgtable_t)0xdead000000000122, > ._pt_pad_2 = (unsigned long)0, > .pt_mm = (struct mm_struct *)0xffffa2afcd181900, > .pt_frag_refcount = (atomic_t){ > .counter = (int)-854058752, > }, > .ptl = (spinlock_t){ > .rlock = (struct raw_spinlock){ > .raw_lock = (arch_spinlock_t){ > .val = (atomic_t){ > .counter = (int)4194304, > }, > .locked = (u8)0, > .pending = (u8)0, > .locked_pending = (u16)0, > .tail = (u16)64, > }, > }, > }, > .pgmap = (struct dev_pagemap *)0xdead000000000100, > .zone_device_data = (void *)0xdead000000000122, > .callback_head = (struct callback_head){ > .next = (struct callback_head *)0xdead000000000100, > .func = (void (*)(struct callback_head > *))0xdead000000000122, > }, > ._mapcount = (atomic_t){ > .counter = (int)-1, > }, > .page_type = (unsigned int)4294967295, > ._refcount = (atomic_t){ > .counter = (int)1, > }, > .memcg_data = (unsigned long)0, > }, > ._flags_1 = (unsigned long)13510764522373120, > .__head = (unsigned long)18446698392541487105, > ._folio_dtor = (unsigned char)1, > ._folio_order = (unsigned char)2, > ._total_mapcount = (atomic_t){ > .counter = (int)-1, > }, > ._pincount = (atomic_t){ > .counter = (int)0, > }, > ._folio_nr_pages = (unsigned int)4, > } > >>> for index, entry in > xa_for_each(trace[7]['mapping'].i_pages.address_of_()): > print(index, entry, cast('struct folio *', > entry).page.mapping.address_of_()) > .... > 6464 (void *)0xffffd674c130a000 *(struct address_space > **)0xffffd674c130a018 = 0xffffa2b30e93b2b0 > 6528 (void *)0xffffd674beb22000 *(struct address_space > **)0xffffd674beb22018 = 0xffffa2b30e93b2b0 > 6592 (void *)0xffffd67406346000 *(struct address_space > **)0xffffd67406346018 = 0x0 <===== our folio > 6624 (void *)0x7037e8d8000100d (struct address_space **)0x7037e8d80001025 > 6625 (void *)0x7037e047000100d (struct address_space **)0x7037e0470001025 > .... > > This looks like the xarray is corrupted, and for some reason we have a > locked folio > in the mapping with a page with no mapping. > > Any suggestions on narrowing this down to a hypothesis to try to reproduce this, > or potential fixes are very much appreciated. We are also trying some > different kernels > configurations on different set of hosts to see if the problems go > away for them, such as: > - 6.1.36 without xfs: Support large folios > 6795801366da0cd3d99e27c37f020a8f16714886 > - 6.1.36 without THP > - 6.1.37 with the following series backported xfs, iomap: fix data > corruption due to stale cached iomaps > https://lore.kernel.org/linux-fsdevel/20221129001632.GX3600936@xxxxxxxxxxxxxxxxxxx/ > > Best, > Daniel.