Hello, Note: dmesg from the crash is included at the bottom of this email. Arch: x86_64 Kernel: 6.9.7 Latest BIOS from Manufacturer:3603 (as of 7/2/2024) Latest NVME Firmware available from Manufacturer: 4B2QJXD7 (as of 7/2/2024) When not using the following kernel options: nvme_core.default_ps_max_latency_us=0 pcie_aspm=off With 6.1.x, the kernel still panics even with the above options. I am testing 6.9.7 to see if the same happens with this version. Does anyone know why this issue continues to occur? I did find a summary of this issue here from Claudio Luck: https://bugzilla.proxmox.com/show_bug.cgi?id=5306 -------------- My summary: Points not debunked: - Some People have success changing PSU - Some People see the problem disappear after NVMe firmware update - Different NVMe brands affected, both with/without DRAM Uncertain points: - A proposed patch for Linux to change amount/size of buffers towards NVMe drive - Some People get stability switching to FreeBSD (TrueNAS etc.) [a] Notable observations: - Different NVMe vendors / controller brands (though, many reports about WD products) - People get RMA-returns with newer firmware - Reproducible across batches of the same product - Most offen large reads/writes (e.g. ZFS resilver) just precede the crash - Some People have thermal imaging footage showing heat buildup before crash I remain with my gut feeling that internal housekeeping in the NVMe firmware controller produces either thermal overload or power regulator overload, locking up the controller. The shutdown sometimes doesn't seem to work as engineered, as in some instances we've had the NVMe not return online after a soft-reboot but then after a cold-boot. -------------- dmesg from the latest crash with 6.9.7 when the following options are NOT used: nvme_core.default_ps_max_latency_us=0 pcie_aspm=off [ 3718.137313] #PF: supervisor read access in kernel mode [ 3718.137321] #PF: error_code(0x0000) - not-present page [ 3718.137328] PGD 0 P4D 0 [ 3718.137333] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 3718.137364] CPU: 8 PID: 0 Comm: swapper/8 Not tainted 6.9.7 #2 [ 3718.137372] Hardware name: ASUSTeK COMPUTER INC. System Product Name/Pro WS W680-ACE IPMI [ 3718.137293] BUG: kernel NULL pointer dereference [ 3718.137382] RIP: 0010:btrfs_clone_write_end_io+0x1e/0x60 [btrfs] [ 3718.137474] RBP: ffff93a9766f8598 R08: ffff93a925547000 R09: 0000000080080007 [ 3718.137497] FS: 0000000000000000(0000) GS:ffff93c37f400000(0000) knlGS:0000000000000000 [ 3718.137531] <IRQ> [ 3718.137549] ? __slab_free+0xdf/0x2f0 [ 3718.137434] Code: 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 53 48 8b 6f 40 48 89 fb 80 7f 19 00 48 8b 45 20 75 27 80 7f 10 07 74 13 <48> 8b 78 18 e8 a9 0d 39 d4 48 89 df 5b 5d e9 4f 09 39 d4 48 8b 57 [ 3718.137466] RDX: ffff93a480df4d40 RSI: ffff93a999c31900 RDI: ffff93a999c31900 [ 3718.137482] R10: 0000000080080007 R11: ffffa7ce00360ff8 R12: 0000000000000000 [ 3718.137506] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3718.137513] CR2: 0000000000000018 CR3: 00000001bddfc005 CR4: 0000000000770ef0 [ 3718.137521] PKRU: 55555554 [ 3718.137525] Call Trace: [ 3718.137536] ? __die+0x1f/0x60 [ 3718.137451] RSP: 0018:ffffa7ce00360e80 EFLAGS: 00010097 [ 3718.137458] RAX: 0000000000000000 RBX: ffff93a999c31900 RCX: ffff93a4a50c03b0 [ 3718.137489] R13: ffff93a999c31900 R14: 0000000000004000 R15: 0000000000004000 [ 3718.137543] ? page_fault_oops+0x179/0x560 [ 3718.137557] ? exc_page_fault+0x72/0x170 [ 3718.137564] ? asm_exc_page_fault+0x22/0x30 [ 3718.137620] ? kfree+0x24f/0x290 [ 3718.137642] nvme_irq+0x3e/0x80 [nvme] [ 3718.137572] ? btrfs_clone_write_end_io+0x1e/0x60 [btrfs] [ 3718.137626] blk_mq_end_request+0x18/0x30 [ 3718.137649] __handle_irq_event_percpu+0x43/0x1a0 [ 3718.137613] blk_update_request+0x110/0x470 [ 3718.137633] nvme_poll_cq+0x18f/0x360 [nvme] [ 3718.137657] handle_irq_event+0x34/0x70 [ 3718.137663] handle_edge_irq+0x87/0x220 [ 3718.137673] common_interrupt+0x7c/0xa0 [ 3718.137683] <TASK> [ 3718.137693] RIP: 0010:cpuidle_enter_state+0xc8/0x430 [ 3718.137668] __common_interrupt+0x38/0xa0 [ 3718.137680] </IRQ> [ 3718.137687] asm_common_interrupt+0x22/0x40 [ 3718.137700] Code: 4e 44 54 ff e8 a9 f1 ff ff 8b 53 04 49 89 c5 0f 1f 44 00 00 31 ff e8 27 53 53 ff 45 84 ff 0f 85 50 02 00 00 fb 0f 1f 44 00 00 <45> 85 f6 0f 88 81 01 00 00 49 63 d6 48 8d 04 52 48 8d 04 82 49 8d [ 3718.137724] RAX: ffff93c37f400000 RBX: ffff93c37f43f4a0 RCX: 000000000000001f [ 3718.137748] R10: 0000000000000018 R11: ffff93c37f433ce4 R12: ffffffff965a2c80 [ 3718.137771] do_idle+0x1e7/0x240 [ 3718.137783] start_secondary+0x118/0x140 [ 3718.137732] RDX: 0000000000000008 RSI: 0000000028291fdf RDI: 0000000000000000 [ 3718.137764] cpuidle_enter+0x29/0x40 [ 3718.137790] common_startup_64+0x13e/0x141 [ 3718.137718] RSP: 0018:ffffa7ce001d3e90 EFLAGS: 00000246 [ 3718.137740] RBP: 0000000000000002 R08: 0000000000000000 R09: 000000000000004e [ 3718.137755] R13: 00000361b240a0fa R14: 0000000000000002 R15: 0000000000000000 [ 3718.137777] cpu_startup_entry+0x25/0x30 [ 3718.137797] </TASK> [ 3718.137800] Modules linked in: tls bluetooth sha3_generic jitterentropy_rng drbg ansi_cprng ecdh_generic ecc crc16 xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE nf_conntrack_netlink xfrm_user xt_addrtype nft_compat br_netfilter bridge nfsv3 nfs netfs tcp_bbr sch_fq tun netconsole nvme_fabrics overlay pps_ldisc cfg80211 8021q garp stp mrp llc lz4 lz4_compress zram zsmalloc binfmt_misc xfs nls_ascii nls_cp437 vfat fat nft_masq nft_redir nft_chain_nat nf_nat intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common x86_pkg_temp_thermal intel_powerclamp coretemp nft_ct nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 kvm_intel kvm ghash_clmulni_intel sha512_ssse3 sha512_generic sha256_ssse3 sha1_ssse3 snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof aesni_intel snd_sof_utils crypto_simd snd_soc_hdac_hda cryptd snd_hda_ext_ [ 3718.137839] snd_soc_acpi snd_soc_core snd_hda_codec_realtek snd_hda_codec_generic nfnetlink_log snd_compress rapl snd_hda_scodec_component nft_log soundwire_bus snd_hda_intel mei_wdt snd_intel_dspcfg mei_hdcp intel_cstate snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep snd_pcm_oss eeepc_wmi snd_mixer_oss asus_wmi snd_pcm iTCO_wdt intel_pmc_bxt snd_timer battery sd_mod sparse_keymap iTCO_vendor_support snd mei_me platform_profile ipmi_ssif rfkill pcspkr soundcore intel_uncore wmi_bmof mei watchdog acpi_ipmi cdc_acm ipmi_si ipmi_devintf joydev ipmi_msghandler intel_pmc_core intel_vsec pmt_telemetry pmt_class acpi_pad acpi_tad sg evdev nfsd parport_pc nf_tables auth_rpcgss nfs_acl ppdev lockd grace lp nfnetlink parport fuse loop efi_pstore dm_mod sunrpc configfs ip_tables x_tables autofs4 cdc_ether usbnet mii btrfs blake2b_generic efivarfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor uas usb_storage raid6_pq libcrc32c crc32c_generic raid1 raid0 md_mod hid_generic usbhid hid [ 3718.137957] sr_mod cdrom nvme nvme_core t10_pi ast ahci ixgbe i2c_algo_bit libahci drm_shmem_helper xhci_pci crc64_rocksoft xfrm_algo libata drm_kms_helper crc64 xhci_hcd dca mdio_devres crc_t10dif scsi_mod intel_lpss_pci crct10dif_generic i2c_i801 crc32_pclmul video usbcore libphy drm crct10dif_pclmul intel_lpss igc crc32c_intel i2c_smbus scsi_common mdio usb_common vmd wmi fan crct10dif_common idma64 pinctrl_alderlake button [ 3718.138083] CR2: 0000000000000018 [ 3718.138089] ---[ end trace 0000000000000000 ]--- [ 3741.360694] rcu: #011(detected by 25 [ 3741.360672] rcu: #0118-...!: (0 ticks this GP) idle=0d94/1/0x4000000000000004 softirq=106413/106413 fqs=11 [ 3741.360712] Sending NMI from CPU 25 to CPUs 8: [ 3741.360631] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 3741.360743] Hardware name: ASUSTeK COMPUTER INC. System Product Name/Pro WS W680-ACE IPMI [ 3741.360737] NMI backtrace for cpu 8 [ 3741.360744] RIP: 0010:memchr+0x5/0x30 [ 3741.360756] RSP: 0018:ffffa7ce003609d0 EFLAGS: 00000097 [ 3741.360739] CPU: 8 PID: 0 Comm: swapper/8 Tainted: G D 6.9.7 #2 [ 3741.360754] Code: cc cc cc cc 48 89 fb 48 89 d8 5b 5d 41 5c 41 5d c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 48 01 fa eb 0c <48> 8d 47 01 40 38 37 74 0f 48 89 c7 48 39 d7 75 ef 31 c0 c3 cc cc [ 3741.360764] FS: 0000000000000000(0000) GS:ffff93c37f400000(0000) knlGS:0000000000000000 [ 3741.360760] RDX: ffff93c3fff6eb0c RSI: 000000000000000a RDI: ffff93c3fff6eae9 [ 3741.360762] R10: 00003fffffffffff R11: ffff93c3fff3c8a0 R12: 0000000000000024 [ 3741.360765] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3741.360767] PKRU: 55555554 [ 3741.360759] RAX: ffff93c3fff6eae9 RBX: 0000000000000001 RCX: 0000000000000000 [ 3741.360761] RBP: 00000000ffffe0b4 R08: fffffffffffc3320 R09: ffffa7ce00360a08 [ 3741.360763] R13: 00000000000000b4 R14: ffffffff96f0bd20 R15: ffff93c3fff6eae8 [ 3741.360766] CR2: 0000000000000018 CR3: 00000001bddfc005 CR4: 0000000000770ef0 [ 3741.360768] Call Trace: [ 3741.360770] <NMI> [ 3741.360774] ? nmi_cpu_backtrace+0x95/0x110 [ 3741.360778] ? nmi_cpu_backtrace_handler+0xd/0x20 [ 3741.360783] ? nmi_handle+0x5a/0x150 [ 3741.360787] ? default_do_nmi+0x40/0x100 [ 3741.360792] ? exc_nmi+0x11e/0x1a0 [ 3741.360794] ? end_repeat_nmi+0xf/0x53 [ 3741.360799] ? memchr+0x5/0x30 [ 3741.360801] ? memchr+0x5/0x30 [ 3741.360803] ? memchr+0x5/0x30 [ 3741.360804] </NMI> [ 3741.360805] <IRQ> [ 3741.360805] _prb_read_valid+0x1d8/0x310 [ 3741.360810] prb_read_valid_info+0x41/0x60 [ 3741.360811] find_first_fitting_seq+0xd5/0x1b0 [ 3741.360815] kmsg_dump_get_buffer+0xe8/0x1d0 [ 3741.360818] pstore_dump+0x171/0x370 [ 3741.360825] kmsg_dump+0x43/0x60 [ 3741.360827] oops_end+0x68/0xe0 [ 3741.360829] page_fault_oops+0x19d/0x560 [ 3741.360832] ? __slab_free+0xdf/0x2f0 [ 3741.360840] exc_page_fault+0x72/0x170 [ 3741.360845] asm_exc_page_fault+0x22/0x30 [ 3741.360848] RIP: 0010:btrfs_clone_write_end_io+0x1e/0x60 [btrfs] [ 3741.360952] Code: 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 53 48 8b 6f 40 48 89 fb 80 7f 19 00 48 8b 45 20 75 27 80 7f 10 07 74 13 <48> 8b 78 18 e8 a9 0d 39 d4 48 89 df 5b 5d e9 4f 09 39 d4 48 8b 57 [ 3741.360953] RSP: 0018:ffffa7ce00360e80 EFLAGS: 00010097 [ 3741.360954] RAX: 0000000000000000 RBX: ffff93a999c31900 RCX: ffff93a4a50c03b0 [ 3741.360954] RDX: ffff93a480df4d40 RSI: ffff93a999c31900 RDI: ffff93a999c31900 [ 3741.360955] RBP: ffff93a9766f8598 R08: ffff93a925547000 R09: 0000000080080007 [ 3741.360956] R10: 0000000080080007 R11: ffffa7ce00360ff8 R12: 0000000000000000 [ 3741.360957] R13: ffff93a999c31900 R14: 0000000000004000 R15: 0000000000004000 [ 3741.360960] blk_update_request+0x110/0x470 [ 3741.360965] ? kfree+0x24f/0x290 [ 3741.360967] blk_mq_end_request+0x18/0x30 [ 3741.360969] nvme_poll_cq+0x18f/0x360 [nvme] [ 3741.360976] nvme_irq+0x3e/0x80 [nvme] [ 3741.360980] __handle_irq_event_percpu+0x43/0x1a0 [ 3741.360983] handle_irq_event+0x34/0x70 [ 3741.360985] handle_edge_irq+0x87/0x220 [ 3741.360990] __common_interrupt+0x38/0xa0 [ 3741.360992] common_interrupt+0x7c/0xa0 [ 3741.360996] </IRQ> [ 3741.360996] <TASK> [ 3741.360997] asm_common_interrupt+0x22/0x40 [ 3741.360999] RIP: 0010:cpuidle_enter_state+0xc8/0x430 [ 3741.361000] Code: 4e 44 54 ff e8 a9 f1 ff ff 8b 53 04 49 89 c5 0f 1f 44 00 00 31 ff e8 27 53 53 ff 45 84 ff 0f 85 50 02 00 00 fb 0f 1f 44 00 00 <45> 85 f6 0f 88 81 01 00 00 49 63 d6 48 8d 04 52 48 8d 04 82 49 8d [ 3741.361002] RSP: 0018:ffffa7ce001d3e90 EFLAGS: 00000246 [ 3741.361003] RAX: ffff93c37f400000 RBX: ffff93c37f43f4a0 RCX: 000000000000001f [ 3741.361003] RDX: 0000000000000008 RSI: 0000000028291fdf RDI: 0000000000000000 [ 3741.361004] RBP: 0000000000000002 R08: 0000000000000000 R09: 000000000000004e [ 3741.361005] R10: 0000000000000018 R11: ffff93c37f433ce4 R12: ffffffff965a2c80 [ 3741.361005] R13: 00000361b240a0fa R14: 0000000000000002 R15: 0000000000000000 [ 3741.361007] cpuidle_enter+0x29/0x40 [ 3741.361012] do_idle+0x1e7/0x240 [ 3741.361016] cpu_startup_entry+0x25/0x30 [ 3741.361018] start_secondary+0x118/0x140 [ 3741.361021] common_startup_64+0x13e/0x141 [ 3741.361025] </TASK> [ 3742.257557] RIP: 0010:btrfs_clone_write_end_io+0x1e/0x60 [btrfs] [ 3742.279785] Code: 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 53 48 8b 6f 40 48 89 fb 80 7f 19 00 48 8b 45 20 75 27 80 7f 10 07 74 13 <48> 8b 78 18 e8 a9 0d 39 d4 48 89 df 5b 5d e9 4f 09 39 d4 48 8b 57 [ 3742.280179] RSP: 0018:ffffa7ce00360e80 EFLAGS: 00010097 [ 3742.280565] RAX: 0000000000000000 RBX: ffff93a999c31900 RCX: ffff93a4a50c03b0 [ 3742.280954] RDX: ffff93a480df4d40 RSI: ffff93a999c31900 RDI: ffff93a999c31900 [ 3742.281344] RBP: ffff93a9766f8598 R08: ffff93a925547000 R09: 0000000080080007 [ 3742.281744] R10: 0000000080080007 R11: ffffa7ce00360ff8 R12: 0000000000000000 [ 3742.282135] R13: ffff93a999c31900 R14: 0000000000004000 R15: 0000000000004000 [ 3742.282517] FS: 0000000000000000(0000) GS:ffff93c37f400000(0000) knlGS:0000000000000000 [ 3742.282898] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3742.283268] CR2: 0000000000000018 CR3: 00000001bddfc005 CR4: 0000000000770ef0 [ 3742.283635] PKRU: 55555554 [ 3742.283995] Kernel panic - not syncing: Fatal exception in interrupt [ 3742.284361] Kernel Offset: 0x13800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 3745.782698] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- [ 15.181985] netconsole-setup: Test log message to verify netconsole configuration. [ 15.467680] NFSD: Using nfsdcld client tracking operations. [ 15.468142] NFSD: no clients to reclaim