Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)

Christian Theune <ct@xxxxxxxxxxxxxxx> · Fri, 13 Sep 2024 15:23:39 +0200

Hi,

> On 13. Sep 2024, at 05:44, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> 
> My current best guess is that we have an xarray node with a stray pointer
> in it; that the node is freed from one xarray, allocated to a different
> xarray, but not properly cleared.  But I can't reproduce the problem,
> so that's pure speculation on my part.

I’d love to help with the reproduction. I understand that BZ is unloved and I guess putting everything I’ve seen so far from various sources into a single spot might help - unfortunately that creates a pretty long mail. I selectively didn’t inline some more far fetched things.

A tiny bit of context about me: I’m a seasoned developer, but not a kernel developer. I don’t know the subsystems from a code perspective. I stare at kernel code (or C code generally) mostly only when things go wrong. I did my share of debugging hard things over the last 25 years and I am good at trying to attack things from multiple angles.

I have 9 non-production VMs that exhibited the issue last year. I can put those on custom compiled kernels and instrument them as needed. Feel free to use me as a resource here.

Rabbit hole 1: the stalls and stack traces
==========================================

I’ve reviewed all of the stall messages (see below) that I could find and noticed:

- All of the VMs that are affected have at least 2 CPUs. I haven’t seen this on any single CPU VMs AFAICT, but I wouldn’t fully eliminate that it could be possible to also be happening there. OTOH (obviously?) race conditions would be easier to produce on an multi processing machine than with a single core … ;)

- I’ve only ever seen it on virtual machines, however, there’s a redhat bug report from 2023 that shows data that points to an Asus board, so it looks like a physical machine, so I guess a physical/virtual machine distinction is not relevant.

- Most call stacks come from the VFS, but I’ve seen two that originate from a page fault (if I’m reading things correctly) - so trying to swap a page in? That’s interesting because it would hint at a reproducer that doesn’t need FS code being involved.

Here’s the stalls that I could recover from my efforts last year:

rcu: INFO: rcu_preempt self-detected stall on CPU
rcu: 	1-....: (1 GPs behind) idle=bcec/1/0x4000000000000000 softirq=32229387/32229388 fqs=3407711
	(t=6825807 jiffies g=51307757 q=12582143 ncpus=2)
CPU: 1 PID: 135430 Comm: systemd-journal Not tainted 6.1.57 #1-NixOS
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
RIP: 0010:__rcu_read_unlock+0x1d/0x30
Code: ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 65 48 8b 3c 25 c0 0b 02 00 83 af 64 04 00 00 01 75 0a 8b 87 68 04 00 00 <85> c0 75 05 c3 cc cc cc cc e9 45 fe ff ff 0f 1f 44 00 00 0f 1f 44
RSP: 0018:ffffa9c442887c78 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffffca97c0ed4000 RCX: 0000000000000000
RDX: ffff88a1919bb6d0 RSI: ffff88a1919bb6d0 RDI: ffff88a187480000
RBP: 0000000000000044 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000100cca
R13: ffff88a2a48836b0 R14: 0000000000001be0 R15: ffffca97c0ed4000
FS:  00007fa45ec86c40(0000) GS:ffff88a2fad00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fa45f436985 CR3: 000000010b4f8000 CR4: 00000000000006e0
Call Trace:
 <IRQ>
 ? rcu_dump_cpu_stacks+0xc8/0x100
 ? rcu_sched_clock_irq.cold+0x15b/0x2fb
 ? sched_slice+0x87/0x140
 ? perf_event_task_tick+0x64/0x370
 ? __cgroup_account_cputime_field+0x5b/0xa0
 ? update_process_times+0x77/0xb0
 ? tick_sched_handle+0x34/0x50
 ? tick_sched_timer+0x6f/0x80
 ? tick_sched_do_timer+0xa0/0xa0
 ? __hrtimer_run_queues+0x112/0x2b0
 ? hrtimer_interrupt+0xfe/0x220
 ? __sysvec_apic_timer_interrupt+0x7f/0x170
 ? sysvec_apic_timer_interrupt+0x99/0xc0
 </IRQ>
 <TASK>
 ? asm_sysvec_apic_timer_interrupt+0x16/0x20
 ? __rcu_read_unlock+0x1d/0x30
 ? xas_load+0x30/0x40
 __filemap_get_folio+0x10a/0x370
 filemap_fault+0x139/0x910
 ? preempt_count_add+0x47/0xa0
 __do_fault+0x31/0x80
 do_fault+0x299/0x410
 __handle_mm_fault+0x623/0xb80
 handle_mm_fault+0xdb/0x2d0
 do_user_addr_fault+0x19c/0x560
 exc_page_fault+0x66/0x150
 asm_exc_page_fault+0x22/0x30
RIP: 0033:0x7fa45f4369af
Code: Unable to access opcode bytes at 0x7fa45f436985.
RSP: 002b:00007fff3ec0a580 EFLAGS: 00010246
RAX: 0000002537ea8ea4 RBX: 00007fff3ec0aab0 RCX: 0000000000000000
RDX: 00007fa45a3dffd0 RSI: 00007fa45a3e0010 RDI: 000055e348682520
RBP: 0000000000000015 R08: 000055e34862fd00 R09: 00007fff3ec0b1b0
R10: 0000000000000000 R11: 0000000000000000 R12: 00007fff3ec0a820
R13: 00007fff3ec0a640 R14: 2f4f057952ecadbd R15: 0000000000000000
 </TASK>

rcu: INFO: rcu_preempt self-detected stall on CPU
rcu:         1-....: (21000 ticks this GP) idle=d1e4/1/0x4000000000000000 softirq=87308049/87308049 fqs=5541
        (t=21002 jiffies g=363533457 q=100563 ncpus=5)
rcu: rcu_preempt kthread starved for 8417 jiffies! g363533457 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=4
rcu:         Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
rcu: RCU grace-period kthread stack dump:
task:rcu_preempt     state:R  running task     stack:0     pid:15    ppid:2      flags:0x00004000
Call Trace:
 <TASK>
 ? rcu_gp_cleanup+0x570/0x570
 __schedule+0x35d/0x1370
 ? get_nohz_timer_target+0x18/0x190
 ? _raw_spin_unlock_irqrestore+0x23/0x40
 ? __mod_timer+0x281/0x3d0
 ? rcu_gp_cleanup+0x570/0x570
 schedule+0x5d/0xe0
 schedule_timeout+0x94/0x150
 ? __bpf_trace_tick_stop+0x10/0x10
 rcu_gp_fqs_loop+0x15b/0x650
 rcu_gp_kthread+0x1a9/0x280
 kthread+0xe9/0x110
 ? kthread_complete_and_exit+0x20/0x20
 ret_from_fork+0x22/0x30
 </TASK>
rcu: Stack dump where RCU GP kthread last ran:
Sending NMI from CPU 1 to CPUs 4:
NMI backtrace for cpu 4
CPU: 4 PID: 529675 Comm: connection Not tainted 6.1.57 #1-NixOS
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
RIP: 0010:xas_descend+0x3/0x90
Code: 48 8b 57 08 48 89 57 10 e9 3a c6 2c 00 48 8b 57 10 48 89 07 48 c1 e8 20 48 89 57 08 e9 26 c6 2c 00 cc cc cc cc cc cc 0f b6 0e <48> 8b 57 08 48 d3 ea 83 e2 3f 89 d0 48 83 c0 04 48 8b 44 c6 08 48
RSP: 0018:ffffa37c47ccfbf8 EFLAGS: 00000246
RAX: ffff92832453e912 RBX: ffffa37c47ccfd78 RCX: 0000000000000000
RDX: 0000000000000002 RSI: ffff92832453e910 RDI: ffffa37c47ccfc08
RBP: 0000000000006305 R08: ffffa37c47ccfe70 R09: ffff92830f538138
R10: ffffa37c47ccfe68 R11: ffff92830f538138 R12: 0000000000006305
R13: ffff92832b518900 R14: 0000000000006305 R15: ffffa37c47ccfe98
FS:  00007fcbee42b6c0(0000) GS:ffff9287a9c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fcc0b10d0d8 CR3: 0000000107632000 CR4: 00000000003506e0
Call Trace:
 <NMI>
 ? nmi_cpu_backtrace.cold+0x1b/0x76
 ? nmi_cpu_backtrace_handler+0xd/0x20
 ? nmi_handle+0x5d/0x120
 ? xas_descend+0x3/0x90
 ? default_do_nmi+0x69/0x170
 ? exc_nmi+0x13c/0x170
 ? end_repeat_nmi+0x16/0x67
 ? xas_descend+0x3/0x90
 ? xas_descend+0x3/0x90
 ? xas_descend+0x3/0x90
 </NMI>
 <TASK>
 xas_load+0x30/0x40
 filemap_get_read_batch+0x16e/0x250
 filemap_get_pages+0xa9/0x630
 ? current_time+0x3c/0x100
 ? atime_needs_update+0x104/0x180
 ? touch_atime+0x46/0x1f0
 filemap_read+0xd2/0x340
 xfs_file_buffered_read+0x4f/0xd0 [xfs]
 xfs_file_read_iter+0x6a/0xd0 [xfs]
 vfs_read+0x23c/0x310
 ksys_read+0x6b/0xf0
 do_syscall_64+0x3a/0x90
 entry_SYSCALL_64_after_hwframe+0x64/0xce
RIP: 0033:0x7fd0ccf0f78c
Code: ec 28 48 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 a9 bb f8 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 34 44 89 c7 48 89 44 24 08 e8 ff bb f8 ff 48
RSP: 002b:00007fcbee427320 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd0ccf0f78c
RDX: 0000000000000014 RSI: 00007fcbee427500 RDI: 0000000000000129
RBP: 00007fcbee427430 R08: 0000000000000000 R09: 00a9b630ab4578b9
R10: 0000000000000001 R11: 0000000000000246 R12: 00007fcbee42a9f8
R13: 0000000000000014 R14: 00000000040ef680 R15: 0000000000000129
 </TASK>
CPU: 1 PID: 529591 Comm: connection Not tainted 6.1.57 #1-NixOS
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
RIP: 0010:xas_descend+0x18/0x90
Code: c1 e8 20 48 89 57 08 e9 26 c6 2c 00 cc cc cc cc cc cc 0f b6 0e 48 8b 57 08 48 d3 ea 83 e2 3f 89 d0 48 83 c0 04 48 8b 44 c6 08 <48> 89 77 18 48 89 c1 83 e1 03 48 83 f9 02 75 08 48 3d fd 00 00 00
RSP: 0018:ffffa37c47b7fbf8 EFLAGS: 00000216
RAX: fffff6e88448e000 RBX: ffffa37c47b7fd78 RCX: 0000000000000000
RDX: 000000000000000d RSI: ffff92832453e910 RDI: ffffa37c47b7fc08
RBP: 000000000000630d R08: ffffa37c47b7fe70 R09: ffff92830f538138
R10: ffffa37c47b7fe68 R11: ffff92830f538138 R12: 000000000000630d
R13: ffff92830b9a3b00 R14: 000000000000630d R15: ffffa37c47b7fe98
FS:  00007fcbf07bb6c0(0000) GS:ffff9287a9900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fcc0ac8e360 CR3: 0000000107632000 CR4: 00000000003506e0
Call Trace:
 <IRQ>
 ? rcu_dump_cpu_stacks+0xc8/0x100
 ? rcu_sched_clock_irq.cold+0x15b/0x2fb
 ? sched_slice+0x87/0x140
 ? perf_event_task_tick+0x64/0x370
 ? __cgroup_account_cputime_field+0x5b/0xa0
 ? update_process_times+0x77/0xb0
 ? tick_sched_handle+0x34/0x50
 ? tick_sched_timer+0x6f/0x80
 ? tick_sched_do_timer+0xa0/0xa0
 ? __hrtimer_run_queues+0x112/0x2b0
 ? hrtimer_interrupt+0xfe/0x220
 ? __sysvec_apic_timer_interrupt+0x7f/0x170
 ? sysvec_apic_timer_interrupt+0x99/0xc0
 </IRQ>
 <TASK>
 ? asm_sysvec_apic_timer_interrupt+0x16/0x20
 ? xas_descend+0x18/0x90
 xas_load+0x30/0x40
 filemap_get_read_batch+0x16e/0x250
 filemap_get_pages+0xa9/0x630
 ? current_time+0x3c/0x100
 ? atime_needs_update+0x104/0x180
 ? touch_atime+0x46/0x1f0
 filemap_read+0xd2/0x340
 xfs_file_buffered_read+0x4f/0xd0 [xfs]
 xfs_file_read_iter+0x6a/0xd0 [xfs]
 vfs_read+0x23c/0x310
 ksys_read+0x6b/0xf0
 do_syscall_64+0x3a/0x90
 </TASK>

rcu: INFO: rcu_preempt self-detected stall on CPU
rcu:    0-....: (21000 ticks this GP) idle=91fc/1/0x4000000000000000 softirq=85252827/85252827 fqs=4704
        (t=21002 jiffies g=167843445 q=13889 ncpus=3)
CPU: 0 PID: 2202919 Comm: .postgres-wrapp Not tainted 6.1.31 #1-NixOS
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
RIP: 0010:xas_descend+0x26/0x70
Code: cc cc cc cc 0f b6 0e 48 8b 57 08 48 d3 ea 83 e2 3f 89 d0 48 83 c0 04 48 8b 44 c6 08 48 89 77 18 48 89 c1 83 e1 03 48 83 f9 02 <75> 08 48 3d fd 00 00 0>
RSP: 0018:ffffb427c4917bf0 EFLAGS: 00000246
RAX: ffff98871f8dbdaa RBX: ffffb427c4917d70 RCX: 0000000000000002
RDX: 0000000000000005 RSI: ffff988876d3c000 RDI: ffffb427c4917c00
RBP: 000000000000f177 R08: ffffb427c4917e68 R09: ffff988846485d38
R10: ffffb427c4917e60 R11: ffff988846485d38 R12: 000000000000f177
R13: ffff988827b4ae00 R14: 000000000000f176 R15: ffffb427c4917e90
FS:  00007ff8de817800(0000) GS:ffff98887ac00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff881c8c000 CR3: 000000010dfea000 CR4: 00000000000006f0

(stack trace is missing in this one)

rcu: INFO: rcu_preempt self-detected stall on CPU
rcu:         1-....: (20915 ticks this GP) idle=4b8c/1/0x4000000000000000 softirq=138338523/138338526 fqs=6063
        (t=21000 jiffies g=180955121 q=35490 ncpus=2)
CPU: 1 PID: 1415835 Comm: .postgres-wrapp Not tainted 6.1.57 #1-NixOS
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
RIP: 0010:filemap_get_read_batch+0x16e/0x250
Code: 85 ff 00 00 00 48 83 c4 40 5b 5d c3 cc cc cc cc f0 ff 0e 0f 84 e1 00 00 00 48 c7 44 24 18 03 00 00 00 48 89 e7 e8 42 ab 6d 00 <48> 89 c7 48 85 ff 74 ba 48 81 ff 06 04 00 00 0f 85 fe fe ff ff 48
RSP: 0018:ffffac01c6887c00 EFLAGS: 00000246
RAX: ffffe5a104574000 RBX: ffffac01c6887d70 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff96db861bcb68 RDI: ffffac01c6887c00
RBP: 0000000000014781 R08: ffffac01c6887e68 R09: ffff96dad46fad38
R10: ffffac01c6887e60 R11: ffff96dad46fad38 R12: 0000000000014781
R13: ffff96db86f47000 R14: 0000000000014780 R15: ffffac01c6887e90
FS:  00007f9ba0a12800(0000) GS:ffff96dbbbd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f9b5050a018 CR3: 000000010ac82000 CR4: 00000000000006e0
Call Trace:
 <IRQ>
 ? rcu_dump_cpu_stacks+0xc8/0x100
 ? rcu_sched_clock_irq.cold+0x15b/0x2fb
 ? sched_slice+0x87/0x140
 ? timekeeping_update+0xdd/0x130
 ? __cgroup_account_cputime_field+0x5b/0xa0
 ? update_process_times+0x77/0xb0
 ? update_wall_time+0xc/0x20
 ? tick_sched_handle+0x34/0x50
 ? tick_sched_timer+0x6f/0x80
 ? tick_sched_do_timer+0xa0/0xa0
 ? __hrtimer_run_queues+0x112/0x2b0
 ? hrtimer_interrupt+0xfe/0x220
 ? __sysvec_apic_timer_interrupt+0x7f/0x170
 ? sysvec_apic_timer_interrupt+0x99/0xc0
 </IRQ>
 <TASK>
 ? asm_sysvec_apic_timer_interrupt+0x16/0x20
 ? filemap_get_read_batch+0x16e/0x250
 filemap_get_pages+0xa9/0x630
 ? iomap_iter+0x78/0x310
 ? iomap_file_buffered_write+0x8f/0x2f0
 filemap_read+0xd2/0x340
 xfs_file_buffered_read+0x4f/0xd0 [xfs]
 xfs_file_read_iter+0x6a/0xd0 [xfs]
 vfs_read+0x23c/0x310
 __x64_sys_pread64+0x94/0xc0
 do_syscall_64+0x3a/0x90
 entry_SYSCALL_64_after_hwframe+0x64/0xce
RIP: 0033:0x7f9ba0b0d787
Code: 48 e8 5d dc f2 ff 41 b8 02 00 00 00 e9 38 f6 ff ff 66 90 f3 0f 1e fa 80 3d 7d bc 0e 00 00 49 89 ca 74 10 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 59 c3 48 83 ec 28 48 89 54 24 10 48 89 74 24
RSP: 002b:00007ffe56bb0878 EFLAGS: 00000202 ORIG_RAX: 0000000000000011
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9ba0b0d787
RDX: 0000000000002000 RSI: 00007f9b5c85ce80 RDI: 000000000000003a
RBP: 0000000000000001 R08: 000000000a00000d R09: 0000000000000000
R10: 0000000014780000 R11: 0000000000000202 R12: 00007f9b90052ab0
R13: 00005566dc227f75 R14: 00005566dc22c510 R15: 00005566de3cf0c0
 </TASK>

rcu: INFO: rcu_preempt self-detected stall on CPU
rcu: 	1-....: (21000 ticks this GP) idle=947c/1/0x4000000000000000 softirq=299845076/299845076 fqs=5249
	(t=21002 jiffies g=500931101 q=17117 ncpus=2)
CPU: 1 PID: 1660396 Comm: nix-collect-gar Not tainted 6.1.55 #1-NixOS
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
RIP: 0010:__xas_next+0x0/0xe0
Code: 48 3d 00 10 00 00 77 c8 48 89 c8 c3 cc cc cc cc e9 f5 fe ff ff 48 c7 47 18 01 00 00 00 31 c9 48 89 c8 c3 cc cc cc cc 0f 1f 00 <48> 8b 47 18 a8 02 75 0e 48 83 47 08 01 48 85 c0 0f 84 b5 00 00 00
RSP: 0018:ffffb170866f7bf8 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffffb170866f7d70 RCX: 0000000000000000
RDX: 0000000000000020 RSI: ffff8ab97d5306d8 RDI: ffffb170866f7c00
RBP: 00000000000011e4 R08: 0000000000000000 R09: ffff8ab9a4dc3d38
R10: ffffb170866f7e60 R11: ffff8ab9a4dc3d38 R12: 00000000000011e4
R13: ffff8ab946fda400 R14: 00000000000011e4 R15: ffffb170866f7e90
FS:  00007f17d22e3f80(0000) GS:ffff8ab9bdd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000013279e8 CR3: 00000000137c8000 CR4: 00000000000006e0
Call Trace:
 <IRQ>
 ? rcu_dump_cpu_stacks+0xc8/0x100
 ? rcu_sched_clock_irq.cold+0x15b/0x2fb
 ? sched_slice+0x87/0x140
 ? perf_event_task_tick+0x64/0x370
 ? __cgroup_account_cputime_field+0x5b/0xa0
 ? update_process_times+0x77/0xb0
 ? tick_sched_handle+0x34/0x50
 ? tick_sched_timer+0x6f/0x80
 ? tick_sched_do_timer+0xa0/0xa0
 ? __hrtimer_run_queues+0x112/0x2b0
 ? hrtimer_interrupt+0xfe/0x220
 ? __sysvec_apic_timer_interrupt+0x7f/0x170
 ? sysvec_apic_timer_interrupt+0x99/0xc0
 </IRQ>
 <TASK>
 ? asm_sysvec_apic_timer_interrupt+0x16/0x20
 ? __xas_prev+0xe0/0xe0
 ? xas_load+0x30/0x40
 filemap_get_read_batch+0x16e/0x250
 filemap_get_pages+0xa9/0x630
 ? atime_needs_update+0x104/0x180
 ? touch_atime+0x46/0x1f0
 filemap_read+0xd2/0x340
 xfs_file_buffered_read+0x4f/0xd0 [xfs]
 xfs_file_read_iter+0x6a/0xd0 [xfs]
 vfs_read+0x23c/0x310
 __x64_sys_pread64+0x94/0xc0
 do_syscall_64+0x3a/0x90
 entry_SYSCALL_64_after_hwframe+0x64/0xce
RIP: 0033:0x7f17d3a2d7c7
Code: 08 89 3c 24 48 89 4c 24 18 e8 75 db f8 ff 4c 8b 54 24 18 48 8b 54 24 10 41 89 c0 48 8b 74 24 08 8b 3c 24 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 04 24 e8 c5 db f8 ff 48 8b
RSP: 002b:00007ffffd9d0fb0 EFLAGS: 00000293 ORIG_RAX: 0000000000000011
RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f17d3a2d7c7
RDX: 0000000000001000 RSI: 000056435be0ccf8 RDI: 0000000000000006
RBP: 00000000011e4000 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000011e4000 R11: 0000000000000293 R12: 0000000000001000
R13: 000056435be0ccf8 R14: 0000000000001000 R15: 000056435bdea370
 </TASK>

I’ve pulled together the various states the stall was detected in a more compact form:

 ? asm_sysvec_apic_timer_interrupt+0x16/0x20
 ? __rcu_read_unlock+0x1d/0x30
 ? xas_load+0x30/0x40
 __filemap_get_folio+0x10a/0x370
 filemap_fault+0x139/0x910

 ? asm_sysvec_apic_timer_interrupt+0x16/0x20
 ? __xas_prev+0xe0/0xe0
 ? xas_load+0x30/0x40
 filemap_get_read_batch+0x16e/0x250

 ? asm_sysvec_apic_timer_interrupt+0x16/0x20
 ? filemap_get_read_batch+0x16e/0x250

 xas_load+0x30/0x40
 filemap_get_read_batch+0x16e/0x250

RIP: 0010:xas_descend+0x26/0x70 (this one was missing the stack trace)

I tried reading through the xarray code, but my C and kernel knowledge is stretched thin trying to understand some of the internals: I couldn’t figure out how __rcu_read_unlock appears from within xas_load, similar to __xas_prev. I stopped diving deeper at that point.

My original bug report also includes an initial grab of multiple stall reports over time on a single machine where the situation unfolded with different stack traces over many hours. It’s a bit long so I’m opting to provide the link: https://bugzilla.kernel.org/show_bug.cgi?id=217572#c0 

I also was wondering whether the stall is stuck or spinning and one of my early comments noticed that with 3 CPUs I had a total of 60% spent in system time, so this sounds like it might be spinning between xas_load and xas_descend. I see there’s some kind of retry mechanism in there and while-loops that might get stuck if the data structures are borked. I think it’s alternating between xas_load and xas_descend, though, so not stuck in xas_descend’s loop itself.

The redhat report "large folio related page cache iteration hang” (https://bugzilla.redhat.com/show_bug.cgi?id=2213967) does show a “kernel bug” message in addition to the known stack around xas_load:

kernel: watchdog: BUG: soft lockup - CPU#28 stuck for 26s! [rocksdb:low:2195]
kernel: Modules linked in: tls nf_conntrack_netbios_ns nf_conntrack_broadcast nft_masq nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 wireguard curve25519_x86_64 libcurve25519_generic ip6_udp_tunnel udp_tunnel tcp_bbr rfkill ip_set nf_tables nfnetlink nct6775 nct6775_core tun hwmon_vid jc42 vfat fat ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core kvm snd_hwdep snd_pcm snd_timer cdc_ether irqbypass acpi_ipmi snd usbnet wmi_bmof rapl ipmi_si k10temp soundcore i2c_piix4 joydev mii ipmi_devintf ipmi_msghandler fuse loop xfs uas usb_storage raid1 hid_cp2112 igb crct10dif_pclmul ast crc32_pclmul nvme crc32c_intel polyval_clmulni dca polyval_generic i2c_algo_bit nvme_core ghash_clmulni_intel ccp sha512_ssse3 wmi sp5100_tco nvme_common
kernel: CPU: 28 PID: 2195 Comm: rocksdb:low Not tainted 6.3.5-100.fc37.x86_64 #1
kernel: Hardware name: To Be Filled By O.E.M. X570D4U/X570D4U, BIOS T1.29b 05/17/2022
kernel: RIP: 0010:xas_load+0x45/0x50
kernel: Code: 3d 00 10 00 00 77 07 5b 5d c3 cc cc cc cc 0f b6 4b 10 48 8d 68 fe 38 48 fe 72 ec 48 89 ee 48 89 df e8 cf fd ff ff 80 7d 00 00 <75> c7 eb d9 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90
kernel: RSP: 0018:ffffaab80392fb40 EFLAGS: 00000246
kernel: RAX: fffff69f82a7c000 RBX: ffffaab80392fb58 RCX: 0000000000000000
kernel: RDX: 0000000000000010 RSI: ffff94a4268a6480 RDI: ffffaab80392fb58
kernel: RBP: ffff94a4268a6480 R08: 0000000000000000 R09: 000000000000424a
kernel: R10: ffff94af1ec69ab0 R11: 0000000000000000 R12: 0000000000001610
kernel: R13: 000000000000160c R14: 000000000000160c R15: ffffaab80392fdf0
kernel: FS: 00007f49f7bfe6c0(0000) GS:ffff94b63f100000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007f01446e9000 CR3: 000000014a4be000 CR4: 0000000000750ee0
kernel: PKRU: 55555554
kernel: Call Trace:
kernel: <IRQ>
kernel: ? watchdog_timer_fn+0x1a8/0x210
kernel: ? __pfx_watchdog_timer_fn+0x10/0x10
kernel: ? __hrtimer_run_queues+0x112/0x2b0
kernel: ? hrtimer_interrupt+0xf8/0x230
kernel: ? __sysvec_apic_timer_interrupt+0x61/0x130
kernel: ? sysvec_apic_timer_interrupt+0x6d/0x90
kernel: </IRQ>
kernel: <TASK>
kernel: ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
kernel: ? xas_load+0x45/0x50
kernel: filemap_get_read_batch+0x179/0x270
kernel: filemap_get_pages+0xab/0x6a0
kernel: ? touch_atime+0x48/0x1b0
kernel: ? filemap_read+0x33f/0x350
kernel: filemap_read+0xdf/0x350
kernel: xfs_file_buffered_read+0x4f/0xd0 [xfs]
kernel: xfs_file_read_iter+0x74/0xe0 [xfs]
kernel: vfs_read+0x240/0x310
kernel: __x64_sys_pread64+0x98/0xd0
kernel: do_syscall_64+0x5f/0x90
kernel: ? native_flush_tlb_local+0x34/0x40
kernel: ? flush_tlb_func+0x10d/0x240
kernel: ? do_syscall_64+0x6b/0x90
kernel: ? sched_clock_cpu+0xf/0x190
kernel: ? irqtime_account_irq+0x40/0xc0
kernel: ? __irq_exit_rcu+0x4b/0xf0
kernel: entry_SYSCALL_64_after_hwframe+0x72/0xdc
kernel: RIP: 0033:0x7f4a0c23c227
kernel: Code: 08 89 3c 24 48 89 4c 24 18 e8 b5 e3 f8 ff 4c 8b 54 24 18 48 8b 54 24 10 41 89 c0 48 8b 74 24 08 8b 3c 24 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 04 24 e8 05 e4 f8 ff 48 8b
kernel: RSP: 002b:00007f49f7bf8310 EFLAGS: 00000293 ORIG_RAX: 0000000000000011
kernel: RAX: ffffffffffffffda RBX: 000000000000424a RCX: 00007f4a0c23c227
kernel: RDX: 000000000000424a RSI: 00007f04294a35c0 RDI: 00000000000004be
kernel: RBP: 00007f49f7bf8460 R08: 0000000000000000 R09: 00007f49f7bf84a0
kernel: R10: 000000000160c718 R11: 0000000000000293 R12: 000000000000424a
kernel: R13: 000000000160c718 R14: 00007f04294a35c0 R15: 0000000000000000
kernel: </TASK>
...
kernel: ------------[ cut here ]------------
kernel: kernel BUG at fs/inode.c:612!
kernel: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
kernel: CPU: 21 PID: 2195 Comm: rocksdb:low Tainted: G L 6.3.5-100.fc37.x86_64 #1
kernel: Hardware name: To Be Filled By O.E.M. X570D4U/X570D4U, BIOS T1.29b 05/17/2022
kernel: RIP: 0010:clear_inode+0x76/0x80
kernel: Code: 2d a8 40 75 2b 48 8b 93 28 01 00 00 48 8d 83 28 01 00 00 48 39 c2 75 1a 48 c7 83 98 00 00 00 60 00 00 00 5b 5d c3 cc cc cc cc <0f> 0b 0f 0b 0f 0b 0f 0b 0f 0b 90 90 90 90 90 90 90 90 90 90 90 90
kernel: RSP: 0018:ffffaab80392fe58 EFLAGS: 00010002
kernel: RAX: 0000000000000000 RBX: ffff94af1ec69938 RCX: 0000000000000000
kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff94af1ec69ab8
kernel: RBP: ffff94af1ec69ab8 R08: ffffaab80392fd38 R09: 0000000000000002
kernel: R10: 0000000000000001 R11: 0000000000000005 R12: ffffffffc08b9860
kernel: R13: ffff94af1ec69938 R14: 00000000ffffff9c R15: ffff94979dd5da40
kernel: FS: 00007f49f7bfe6c0(0000) GS:ffff94b63ef40000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007eefca8e2000 CR3: 000000014a4be000 CR4: 0000000000750ee0
kernel: PKRU: 55555554
kernel: Call Trace:
kernel: <TASK>
kernel: ? die+0x36/0x90
kernel: ? do_trap+0xda/0x100
kernel: ? clear_inode+0x76/0x80
kernel: ? do_error_trap+0x6a/0x90
kernel: ? clear_inode+0x76/0x80
kernel: ? exc_invalid_op+0x50/0x70
kernel: ? clear_inode+0x76/0x80
kernel: ? asm_exc_invalid_op+0x1a/0x20
kernel: ? clear_inode+0x76/0x80
kernel: ? clear_inode+0x1d/0x80
kernel: evict+0x1b8/0x1d0
kernel: do_unlinkat+0x174/0x320
kernel: __x64_sys_unlink+0x42/0x70
kernel: do_syscall_64+0x5f/0x90
kernel: ? __irq_exit_rcu+0x4b/0xf0
kernel: entry_SYSCALL_64_after_hwframe+0x72/0xdc
kernel: RIP: 0033:0x7f4a0c23faab
kernel: Code: f0 ff ff 73 01 c3 48 8b 0d 82 63 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 57 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 55 63 0d 00 f7 d8 64 89 01 48
kernel: RSP: 002b:00007f49f7bfab58 EFLAGS: 00000206 ORIG_RAX: 0000000000000057
kernel: RAX: ffffffffffffffda RBX: 00007f49f7bfac38 RCX: 00007f4a0c23faab
kernel: RDX: 00007f49f7bfadd0 RSI: 00007f4a0bc2fd30 RDI: 00007f49dd3c32d0
kernel: RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000
kernel: R10: ffffffffffffdf58 R11: 0000000000000206 R12: 0000000000280bc0
kernel: R13: 00007f4a0bca77b8 R14: 00007f49f7bfadd0 R15: 00007f49f7bfadd0
kernel: </TASK>
kernel: Modules linked in: tls nf_conntrack_netbios_ns nf_conntrack_broadcast nft_masq nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 wireguard curve25519_x86_64 libcurve25519_generic ip6_udp_tunnel udp_tunnel tcp_bbr rfkill ip_set nf_tables nfnetlink nct6775 nct6775_core tun hwmon_vid jc42 vfat fat ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core kvm snd_hwdep snd_pcm snd_timer cdc_ether irqbypass acpi_ipmi snd usbnet wmi_bmof rapl ipmi_si k10temp soundcore i2c_piix4 joydev mii ipmi_devintf ipmi_msghandler fuse loop xfs uas usb_storage raid1 hid_cp2112 igb crct10dif_pclmul ast crc32_pclmul nvme crc32c_intel polyval_clmulni dca polyval_generic i2c_algo_bit nvme_core ghash_clmulni_intel ccp sha512_ssse3 wmi sp5100_tco nvme_common
kernel: ---[ end trace 0000000000000000 ]---
kernel: RIP: 0010:clear_inode+0x76/0x80
kernel: Code: 2d a8 40 75 2b 48 8b 93 28 01 00 00 48 8d 83 28 01 00 00 48 39 c2 75 1a 48 c7 83 98 00 00 00 60 00 00 00 5b 5d c3 cc cc cc cc <0f> 0b 0f 0b 0f 0b 0f 0b 0f 0b 90 90 90 90 90 90 90 90 90 90 90 90
kernel: RSP: 0018:ffffaab80392fe58 EFLAGS: 00010002
kernel: RAX: 0000000000000000 RBX: ffff94af1ec69938 RCX: 0000000000000000
kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff94af1ec69ab8
kernel: RBP: ffff94af1ec69ab8 R08: ffffaab80392fd38 R09: 0000000000000002
kernel: R10: 0000000000000001 R11: 0000000000000005 R12: ffffffffc08b9860
kernel: R13: ffff94af1ec69938 R14: 00000000ffffff9c R15: ffff94979dd5da40
kernel: FS: 00007f49f7bfe6c0(0000) GS:ffff94b63ef40000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007eefca8e2000 CR3: 000000014a4be000 CR4: 0000000000750ee0
kernel: PKRU: 55555554
kernel: note: rocksdb:low[2195] exited with irqs disabled
kernel: note: rocksdb:low[2195] exited with preempt_count 1

Above report is showing rocksdb as the workload with relatively short uptimes around 30 minutes. Maybe there’s a reproducer around there somewhere? I’ve CCed the reporter from there to maybe get some insight on his workload.

Rabbit hole 2: things that were already considered 
==================================================

There were a number of potential vectors and bugfixes that were discussed/referenced but haven’t turned out to fix the issue overall. Some of them might be obvious red herrings by now, but I’m not sure which.

* [GIT PULL] xfs, iomap: fix data corruption due to stale cached iomaps (https://lore.kernel.org/linux-fsdevel/20221129001632.GX3600936@xxxxxxxxxxxxxxxxxxx/)

* cbc02854331e ("XArray: Do not return sibling entries from xa_load()”) did not help here

* I think i’ve seen that the affected data on disk ended up being null bytes, but I can’t verify that.

* There was a fix close to this in “_filemap_get_folio and NULL pointer dereference” (https://bugzilla.kernel.org/show_bug.cgi?id=217441) and "having TRANSPARENT_HUGEPAGE enabled hangs some applications (supervisor read access in kernel mode)” https://bugzilla.kernel.org/show_bug.cgi?id=216646 but their traces looked slightly different from the ones discussed here as did their outcome. Interestingly: those are also on the page fault path, not an fs path.

* memcg was in the stack and under question at some point but it also happens without it

* i was wondering whether increased readahead sizes might cause issues (most our VMs run 128kb but DB VMs run with 1MiB. However, this might also be a red herring as the single vs. multi core situation correlates strongly in our case).

Maybe offtopic but maybe it spurs ideas – a situation that felt similar to the stalls here: I remember debugging a memory issue in Python’s small object allocator a number of years ago that resulted in segfaults and I wonder whether the stalls we’re seeing are only a delayed symptom of an earlier corruption somewhere else. The Python issue was a third party module that caused an out of bounds write into an adjacent byte that was used as a pointer for arena management. That one was also extremely hard to track down due to this indirection / "magic at a distance” behaviour.

That’s all the input I have.

My offer stands: I can take time and run a number of machines that exhibited the behaviour on custom kernels to gather data. 

Cheers,
Christian

-- 
Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick