On Fri, Jul 21, 2023 at 11:49:04AM +0100, Daniel Dao wrote: > We do not have a reproducer yet, but we now have more debugging data > which hopefully > should help narrow this down. Details as followed: > > 1. Kernel NULL pointer deferencences in __filemap_get_folio > > This happened on a few different hosts, with a few different repeated addresses. > The addresses are 0000000000000036, 0000000000000076, > 00000000000000f6. This looks > like the xarray is corrupted and we were trying to do some work on a > sibling entry. I think I have a fix for this one. Please try the attached. > 2. Kernel NULL pointer deferencences in xfs_read_iomap_begin > > BUG: unable to handle page fault for address: 0000000000034668 > #PF: supervisor read access in kernel mode > #PF: error_code(0x0000) - not-present page > PGD 11cfd37067 P4D 11cfd37067 PUD b88086067 PMD 0 > Oops: 0000 [#1] PREEMPT SMP NOPTI > CPU: 124 PID: 3831226 Comm: rocksdb:low Kdump: loaded Tainted: G > W O L 6.1.27-cloudflare-2023.5.0 #1 > Hardware name: HYVE EDGE-METAL-GEN11/HS1811D_Lite, BIOS V0.11-sig 12/23/2022 > RIP: 0010:xfs_read_iomap_begin (fs/xfs/xfs_iomap.c:1200) > Code: 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 83 ec 50 48 > 89 14 24 4c 89 44 24 08 65 48 8b 04 25 28 00 00 00 48 89 44 24 48 <48> > 8b 87 > > All code > ======== > 0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > 5: 41 57 push %r15 > 7: 41 56 push %r14 > 9: 41 55 push %r13 > b: 41 54 push %r12 > d: 55 push %rbp > e: 53 push %rbx > f: 48 83 ec 50 sub $0x50,%rsp > 13: 48 89 14 24 mov %rdx,(%rsp) > 17: 4c 89 44 24 08 mov %r8,0x8(%rsp) > 1c: 65 48 8b 04 25 28 00 mov %gs:0x28,%rax > 23: 00 00 > 25: 48 89 44 24 48 mov %rax,0x48(%rsp) > 2a:* 48 rex.W <-- trapping instruction > 2b: 8b .byte 0x8b > 2c: 87 00 xchg %eax,(%rax) > > Code starting with the faulting instruction > =========================================== > 0: 48 rex.W > 1: 8b .byte 0x8b > 2: 87 00 xchg %eax,(%rax) This one is hard to understand because the decoding of the instruction got cut off. But ... > RSP: 0018:ffffa63810733a70 EFLAGS: 00010282 > RAX: 78ac714f0997e100 RBX: ffffa63810733b40 RCX: 0000000000000000 > RDX: 0000000000004000 RSI: 0000000000000000 RDI: 00000000000347a0 RDI is kind of close to the fault address ... RDI is used as the first argument in the x86-64 SYSV ABI, and the first parameter to xfs_read_iomap_begin() is supposed to be a struct inode pointer. I don't think this is related. > We also have a deadlock reading a very specific file on this host. We managed to > do a kdump on this host and extracted out the state of the mapping. This is almost certainly a different bug, but alos XArray related, so I'll keep looking at this one.