Hi, I'm using Linux 5.4.69 with the following two patches applied for bcache: commit 125d98edd114 ("bcache: remove member accessed from struct btree") commit d5c9c470b011 ("bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan()") I'm using bcache in write-back mode... the cache device is a RAID1 mirror set using NVMe drives, and several backing devices are associated with that cache device. While driving I/O, I experienced the following kernel panic: SYSTEM MAP: /home/marc.smith/Downloads/System.map-esos.prod DEBUG KERNEL: /home/marc.smith/Downloads/vmlinux-esos.prod (5.4.69-esos.prod) DUMPFILE: /home/marc.smith/Downloads/dumpfile-1604062993 CPUS: 8 DATE: Fri Oct 30 09:02:56 2020 UPTIME: 2 days, 12:38:15 LOAD AVERAGE: 9.48, 8.89, 7.69 TASKS: 980 NODENAME: node-10cccd-2 RELEASE: 5.4.69-esos.prod VERSION: #1 SMP Thu Oct 22 19:45:11 UTC 2020 MACHINE: x86_64 (2799 Mhz) MEMORY: 24 GB PANIC: "Oops: 0002 [#1] SMP NOPTI" (check log for details) PID: 18272 COMMAND: "kworker/2:13" TASK: ffff88841d9e8000 [THREAD_INFO: ffff88841d9e8000] CPU: 2 STATE: TASK_UNINTERRUPTIBLE (PANIC) crash> bt PID: 18272 TASK: ffff88841d9e8000 CPU: 2 COMMAND: "kworker/2:13" #0 [ffffc90000100938] machine_kexec at ffffffff8103d6b5 #1 [ffffc90000100980] __crash_kexec at ffffffff8110d37b #2 [ffffc90000100a48] crash_kexec at ffffffff8110e07d #3 [ffffc90000100a58] oops_end at ffffffff8101a9de #4 [ffffc90000100a78] no_context at ffffffff81045e99 #5 [ffffc90000100ae0] async_page_fault at ffffffff81e010cf [exception RIP: atomic_try_cmpxchg+2] RIP: ffffffff810d3e3b RSP: ffffc90000100b98 RFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000080006 RDX: 0000000000000001 RSI: ffffc90000100ba4 RDI: 0000000000000a6c RBP: 0000000000000010 R8: 0000000000000001 R9: ffffffffa0418d4e R10: ffff88841c8b3000 R11: ffff88841c8b3000 R12: 0000000000000046 R13: 0000000000000000 R14: ffff8885a3a0a000 R15: 0000000000000a6c ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #6 [ffffc90000100b98] _raw_spin_lock_irqsave at ffffffff81cf7d7d #7 [ffffc90000100bb8] try_to_wake_up at ffffffff810c1624 #8 [ffffc90000100c08] closure_sync_fn at ffffffffa040fb07 [bcache] #9 [ffffc90000100c10] clone_endio at ffffffff81aac48c #10 [ffffc90000100c40] call_bio_endio at ffffffff81a78e20 #11 [ffffc90000100c58] raid_end_bio_io at ffffffff81a78e69 #12 [ffffc90000100c88] raid1_end_write_request at ffffffff81a79ad9 #13 [ffffc90000100cf8] blk_update_request at ffffffff814c3ab1 #14 [ffffc90000100d38] blk_mq_end_request at ffffffff814caaf2 #15 [ffffc90000100d50] blk_mq_complete_request at ffffffff814c91c1 #16 [ffffc90000100d78] nvme_complete_cqes at ffffffffa002fb03 [nvme] #17 [ffffc90000100db8] nvme_irq at ffffffffa002fb7f [nvme] #18 [ffffc90000100de0] __handle_irq_event_percpu at ffffffff810e0d60 #19 [ffffc90000100e20] handle_irq_event_percpu at ffffffff810e0e65 #20 [ffffc90000100e48] handle_irq_event at ffffffff810e0ecb #21 [ffffc90000100e60] handle_edge_irq at ffffffff810e494d #22 [ffffc90000100e78] do_IRQ at ffffffff81e01900 #23 [ffffc90000100eb0] common_interrupt at ffffffff81e00a0a #24 [ffffc90000100f38] __softirqentry_text_start at ffffffff8200006a #25 [ffffc90000100fc8] irq_exit at ffffffff810a3f6a #26 [ffffc90000100fd0] smp_apic_timer_interrupt at ffffffff81e020b2 bt: invalid kernel virtual address: ffffc90000101000 type: "pt_regs" crash> I noticed in the call trace that closure_sync_fn() is just before the thread is woken; I saw one patch from a year ago for closure_sync_fn() but of course this is already applied in 5.4.69: https://lore.kernel.org/patchwork/patch/1086698/ I haven't encountered this panic in any prior testing, so it appears to be rare so far. Not sure what else could be done to debug? I'll continue testing with heaving I/O to see if this can be reproduced. --Marc