On Mon, Jul 29, 2024 at 11:18 AM Max Kellermann <max.kellermann@xxxxxxxxx> wrote: > I posted two candidate patches which both fix this bug; > > Minimal fix: https://lore.kernel.org/lkml/20240729090639.852732-1-max.kellermann@xxxxxxxxx/ > Fix which removes a bunch of obsolete code: > https://lore.kernel.org/lkml/20240729091532.855688-1-max.kellermann@xxxxxxxxx/ These patches do fix the RCU stall bug (and should be merged), but after running one cluster with my patch for a while, I found more Ceph crashes: ------------[ cut here ]------------ WARNING: CPU: 3 PID: 1925 at fs/ceph/caps.c:3386 ceph_put_wrbuffer_cap_refs+0x1bb/0x1f0 Modules linked in: CPU: 3 PID: 1925 Comm: kworker/3:2 Not tainted 6.10.2-cm4all1-vm+ #168 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Workqueue: ceph-cap ceph_cap_reclaim_work RIP: 0010:ceph_put_wrbuffer_cap_refs+0x1bb/0x1f0 Code: 30 45 89 f5 bd 01 00 00 00 41 83 c6 01 31 d2 e9 fa fe ff ff 45 8d 6e ff 31 ed 31 d2 48 83 bb 18 04 00 00 00 0f 84 e4 fe ff ff <0f> 0b e9 dd fe ff ff 45 8d 6e ff bd 01 00 00 00 ba 01 00 00 00 48 RSP: 0018:ffffb9a7406cba78 EFLAGS: 00010282 RAX: ffff9a2d42b7eb20 RBX: ffff9a2d42b7e688 RCX: ffffdbdadc446d80 RDX: 0000000000000000 RSI: ffff9a2d42b7eb18 RDI: ffff9a2d42b7e940 RBP: 0000000000000000 R08: ffffffffffffffc0 R09: ffff9a2d4254fc40 R10: 0000000000000020 R11: fefefefefefefeff R12: ffff9a2d42b7e940 R13: 00000000ffffffff R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff9a384eec0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055b248c82657 CR3: 000000010d31c002 CR4: 00000000001706b0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __warn+0x7c/0x110 ? ceph_put_wrbuffer_cap_refs+0x1bb/0x1f0 ? report_bug+0x14c/0x170 ? handle_bug+0x3c/0x70 ? exc_invalid_op+0x13/0x60 ? asm_exc_invalid_op+0x16/0x20 ? ceph_put_wrbuffer_cap_refs+0x1bb/0x1f0 ? ceph_put_wrbuffer_cap_refs+0x27/0x1f0 ceph_invalidate_folio+0x9a/0xc0 truncate_cleanup_folio+0x52/0x90 truncate_inode_pages_range+0xfe/0x400 ceph_evict_inode+0x40/0x200 evict+0xc5/0x170 __dentry_kill+0x6e/0x160 dput+0xcb/0x180 __dentry_leases_walk+0x28d/0x430 ceph_trim_dentries+0xac/0x100 ceph_cap_reclaim_work+0x15/0x50 process_one_work+0x138/0x2e0 worker_thread+0x2b9/0x3d0 ? __pfx_worker_thread+0x10/0x10 kthread+0xba/0xe0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> ---[ end trace 0000000000000000 ]--- BUG: kernel NULL pointer dereference, address: 0000000000000356 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: Oops: 0002 [#1] SMP PTI CPU: 3 PID: 1925 Comm: kworker/3:2 Tainted: G W 6.10.2-cm4all1-vm+ #168 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Workqueue: ceph-cap ceph_cap_reclaim_work RIP: 0010:ceph_put_snap_context+0xf/0x30 Code: 0f 1f 84 00 00 00 00 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 85 ff 74 12 b8 ff ff ff ff <f0> 0f c1 07 83 f8 01 74 09 85 c0 7e 0a c3 cc cc cc cc e9 3a 62 70 RSP: 0018:ffffb9a7406cbaa8 EFLAGS: 00010206 RAX: 00000000ffffffff RBX: ffffdbdadc4465c0 RCX: ffffdbdadc446d80 RDX: 0000000000000000 RSI: ffff9a2d42b7eb18 RDI: 0000000000000356 RBP: 0000000000001000 R08: ffffffffffffffc0 R09: ffff9a2d4254fc40 R10: 0000000000000020 R11: fefefefefefefeff R12: 0000000000000356 R13: ffff9a2d42b7e688 R14: ffffffffffffffff R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff9a384eec0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000356 CR3: 000000010d31c002 CR4: 00000000001706b0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __die+0x1f/0x60 ? page_fault_oops+0x158/0x450 ? search_extable+0x22/0x30 ? ceph_put_snap_context+0xf/0x30 ? search_module_extables+0xe/0x40 ? exc_page_fault+0x62/0x120 ? asm_exc_page_fault+0x22/0x30 ? ceph_put_snap_context+0xf/0x30 ceph_invalidate_folio+0xa2/0xc0 truncate_cleanup_folio+0x52/0x90 truncate_inode_pages_range+0xfe/0x400 ceph_evict_inode+0x40/0x200 evict+0xc5/0x170 __dentry_kill+0x6e/0x160 dput+0xcb/0x180 __dentry_leases_walk+0x28d/0x430 ceph_trim_dentries+0xac/0x100 ceph_cap_reclaim_work+0x15/0x50 process_one_work+0x138/0x2e0 worker_thread+0x2b9/0x3d0 ? __pfx_worker_thread+0x10/0x10 kthread+0xba/0xe0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Modules linked in: CR2: 0000000000000356 ---[ end trace 0000000000000000 ]--- RIP: 0010:ceph_put_snap_context+0xf/0x30 Code: 0f 1f 84 00 00 00 00 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 85 ff 74 12 b8 ff ff ff ff <f0> 0f c1 07 83 f8 01 74 09 85 c0 7e 0a c3 cc cc cc cc e9 3a 62 70 RSP: 0018:ffffb9a7406cbaa8 EFLAGS: 00010206 RAX: 00000000ffffffff RBX: ffffdbdadc4465c0 RCX: ffffdbdadc446d80 RDX: 0000000000000000 RSI: ffff9a2d42b7eb18 RDI: 0000000000000356 RBP: 0000000000001000 R08: ffffffffffffffc0 R09: ffff9a2d4254fc40 R10: 0000000000000020 R11: fefefefefefefeff R12: 0000000000000356 R13: ffff9a2d42b7e688 R14: ffffffffffffffff R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff9a384eec0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000356 CR3: 000000010d31c002 CR4: 00000000001706b0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 note: kworker/3:2[1925] exited with irqs disabled The bug hunt continues. Max