Good day everyone, This is a bit of a recurring theme for us on a new deployment performed at 16.2.6 on Ubuntu 20.04.3 with HWE stack. We have had good stability over the past 3 weeks or so copying data, and we now have about 230M objects (470TB of 1PB used) and we have had 1 OSD drop from each of the two OSD hosts currently in this cluster. The third node is a monitor but not an OSD host at this time. Size is set at 2 for now as we're migrating from an older Nautilus cluster. I'm seeing the following on one host: [905905.006041] BUG: kernel NULL pointer dereference, address: 00000000000000c0 [905905.008536] #PF: supervisor read access in kernel mode [905905.011112] #PF: error_code(0x0000) - not-present page [905905.013859] PGD 10f1b6067 P4D 10f1b6067 PUD 11bd2e067 PMD 0 [905905.016678] Oops: 0000 [#1] SMP NOPTI [905905.018982] CPU: 86 PID: 69590 Comm: node_exporter Tainted: G OE 5.11.0-38-generic #42~20.04.1-Ubuntu [905905.020582] Hardware name: Supermicro SSG-6049P-E1CR60L+/X11DSC+, BIOS 3.3 02/21/2020 [905905.022192] RIP: 0010:blk_mq_put_rq_ref+0xa/0x60 [905905.023724] Code: 15 0f b6 d3 4c 89 e7 be 01 00 00 00 e8 cf fe ff ff 5b 41 5c 5d c3 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 47 10 <48> 8b 80 c0 00 00 00 48 89 e5 48 3b 78 40 74 1f 4c 8d 87 e8 00 00 [905905.026794] RSP: 0018:ffffb7713ca3fb08 EFLAGS: 00010202 [905905.029492] RAX: 0000000000000000 RBX: ffffb7713ca3fb90 RCX: 0000000000000002 [905905.031813] RDX: 0000000000000001 RSI: 0000000000000206 RDI: ffff9ea4f46ce400 [905905.034098] RBP: ffffb7713ca3fb40 R08: 0000000000000000 R09: 0000000000000005 [905905.036360] R10: 0000000000000825 R11: 000000000000000b R12: ffff9ea4f46ce400 [905905.038564] R13: ffff9ea4f426ec00 R14: 0000000000000000 R15: 0000000000000001 [905905.040345] FS: 000000c000507910(0000) GS:ffff9f033fb80000(0000) knlGS:0000000000000000 [905905.042034] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [905905.043660] CR2: 00000000000000c0 CR3: 000000010d2bc001 CR4: 00000000007706e0 [905905.045298] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [905905.046883] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [905905.048424] PKRU: 55555554 [905905.049979] Call Trace: [905905.051907] ? bt_iter+0x54/0x90 [905905.053815] blk_mq_queue_tag_busy_iter+0x18b/0x2d0 [905905.055730] ? blk_mq_hctx_mark_pending+0x70/0x70 [905905.057711] ? blk_mq_hctx_mark_pending+0x70/0x70 [905905.059548] blk_mq_in_flight+0x38/0x60 [905905.061409] diskstats_show+0x75/0x2b0 [905905.063166] seq_read_iter+0x2a3/0x450 [905905.064871] proc_reg_read_iter+0x5e/0x80 [905905.066648] new_sync_read+0x110/0x1a0 [905905.068276] vfs_read+0x154/0x1b0 [905905.069879] ksys_read+0x67/0xe0 [905905.071436] __x64_sys_read+0x1a/0x20 [905905.072934] do_syscall_64+0x38/0x90 [905905.074409] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [905905.075841] RIP: 0033:0x4a5c20 [905905.077230] Code: 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 49 c7 c2 00 00 00 00 49 c7 c0 00 00 00 00 49 c7 c1 00 00 00 00 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30 [905905.080065] RSP: 002b:000000c00016e8b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000 [905905.081441] RAX: ffffffffffffffda RBX: 000000c000030a00 RCX: 00000000004a5c20 [905905.082878] RDX: 0000000000001000 RSI: 000000c00062c000 RDI: 0000000000000006 [905905.084299] RBP: 000000c00016e908 R08: 0000000000000000 R09: 0000000000000000 [905905.085673] R10: 0000000000000000 R11: 0000000000000212 R12: 0000000000000040 [905905.087018] R13: 0000000000000040 R14: 0000000000b6420c R15: 0000000000000000 [905905.088305] Modules linked in: ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs cpuid binfmt_misc overlay cuse bonding rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common isst_if_common skx_edac nfit x86_pkg_temp_thermal coretemp kvm_intel kvm rapl ipmi_ssif intel_cstate mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) efi_pstore input_leds joydev intel_pch_thermal mei_me mei ioatdma acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid sch_fq_codel msr ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear hid_generic usbhid hid ses enclosure scsi_transport_sas raid1 mlx5_core(OE) crct10dif_pclmul ast drm_vram_helper i2c_algo_bit crc32_pclmul drm_ttm_helper ttm ghash_clmulni_intel aesni_intel drm_kms_helper syscopyarea pci_hyperv_intf sysfillrect [905905.088448] crypto_simd mlxdevm(OE) sysimgblt cryptd fb_sys_fops psample glue_helper cec mlxfw(OE) rc_core ixgbe tls drm mlx_compat(OE) xfrm_algo megaraid_sas dca mdio vmd i2c_i801 xhci_pci ahci i2c_smbus lpc_ich xhci_pci_renesas libahci wmi [905905.102788] CR2: 00000000000000c0 [905905.104158] ---[ end trace 2e934962aff06160 ]--- [905905.155465] RIP: 0010:blk_mq_put_rq_ref+0xa/0x60 [905905.157004] Code: 15 0f b6 d3 4c 89 e7 be 01 00 00 00 e8 cf fe ff ff 5b 41 5c 5d c3 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 47 10 <48> 8b 80 c0 00 00 00 48 89 e5 48 3b 78 40 74 1f 4c 8d 87 e8 00 00 [905905.160064] RSP: 0018:ffffb7713ca3fb08 EFLAGS: 00010202 [905905.161691] RAX: 0000000000000000 RBX: ffffb7713ca3fb90 RCX: 0000000000000002 [905905.163258] RDX: 0000000000000001 RSI: 0000000000000206 RDI: ffff9ea4f46ce400 [905905.164868] RBP: ffffb7713ca3fb40 R08: 0000000000000000 R09: 0000000000000005 [905905.166291] R10: 0000000000000825 R11: 000000000000000b R12: ffff9ea4f46ce400 [905905.167694] R13: ffff9ea4f426ec00 R14: 0000000000000000 R15: 0000000000000001 [905905.169103] FS: 000000c000507910(0000) GS:ffff9f033fb80000(0000) knlGS:0000000000000000 [905905.170554] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [905905.172001] CR2: 00000000000000c0 CR3: 000000010d2bc001 CR4: 00000000007706e0 [905905.173576] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [905905.174942] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [905905.176424] PKRU: 55555554 root@prdrepoceph01:~# I'm going to look at OSD logs to see if I can find any smoking guns, but thought I would reach out in case someone had seen this before. Thanks, Marco _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx