Hello, Not sure whether this is perhaps related: https://bugs.launchpad.net/ubuntu/+source/linux-meta-gcp-5.11/+bug/1948471 Any insight would be appreciated Thanks, Marco On Wed, Nov 17, 2021 at 9:18 AM Marco Pizzolo <marcopizzolo@xxxxxxxxx> wrote: > Good day everyone, > > This is a bit of a recurring theme for us on a new deployment performed at > 16.2.6 on Ubuntu 20.04.3 with HWE stack. > > We have had good stability over the past 3 weeks or so copying data, and > we now have about 230M objects (470TB of 1PB used) and we have had 1 OSD > drop from each of the two OSD hosts currently in this cluster. The third > node is a monitor but not an OSD host at this time. Size is set at 2 for > now as we're migrating from an older Nautilus cluster. > > I'm seeing the following on one host: > > [905905.006041] BUG: kernel NULL pointer dereference, address: > 00000000000000c0 > [905905.008536] #PF: supervisor read access in kernel mode > [905905.011112] #PF: error_code(0x0000) - not-present page > [905905.013859] PGD 10f1b6067 P4D 10f1b6067 PUD 11bd2e067 PMD 0 > [905905.016678] Oops: 0000 [#1] SMP NOPTI > [905905.018982] CPU: 86 PID: 69590 Comm: node_exporter Tainted: G > OE 5.11.0-38-generic #42~20.04.1-Ubuntu > [905905.020582] Hardware name: Supermicro SSG-6049P-E1CR60L+/X11DSC+, BIOS > 3.3 02/21/2020 > [905905.022192] RIP: 0010:blk_mq_put_rq_ref+0xa/0x60 > [905905.023724] Code: 15 0f b6 d3 4c 89 e7 be 01 00 00 00 e8 cf fe ff ff > 5b 41 5c 5d c3 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 47 10 > <48> 8b 80 c0 00 00 00 48 89 e5 48 3b 78 40 74 1f 4c 8d 87 e8 00 00 > [905905.026794] RSP: 0018:ffffb7713ca3fb08 EFLAGS: 00010202 > [905905.029492] RAX: 0000000000000000 RBX: ffffb7713ca3fb90 RCX: > 0000000000000002 > [905905.031813] RDX: 0000000000000001 RSI: 0000000000000206 RDI: > ffff9ea4f46ce400 > [905905.034098] RBP: ffffb7713ca3fb40 R08: 0000000000000000 R09: > 0000000000000005 > [905905.036360] R10: 0000000000000825 R11: 000000000000000b R12: > ffff9ea4f46ce400 > [905905.038564] R13: ffff9ea4f426ec00 R14: 0000000000000000 R15: > 0000000000000001 > [905905.040345] FS: 000000c000507910(0000) GS:ffff9f033fb80000(0000) > knlGS:0000000000000000 > [905905.042034] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [905905.043660] CR2: 00000000000000c0 CR3: 000000010d2bc001 CR4: > 00000000007706e0 > [905905.045298] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [905905.046883] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > 0000000000000400 > [905905.048424] PKRU: 55555554 > [905905.049979] Call Trace: > [905905.051907] ? bt_iter+0x54/0x90 > [905905.053815] blk_mq_queue_tag_busy_iter+0x18b/0x2d0 > [905905.055730] ? blk_mq_hctx_mark_pending+0x70/0x70 > [905905.057711] ? blk_mq_hctx_mark_pending+0x70/0x70 > [905905.059548] blk_mq_in_flight+0x38/0x60 > [905905.061409] diskstats_show+0x75/0x2b0 > [905905.063166] seq_read_iter+0x2a3/0x450 > [905905.064871] proc_reg_read_iter+0x5e/0x80 > [905905.066648] new_sync_read+0x110/0x1a0 > [905905.068276] vfs_read+0x154/0x1b0 > [905905.069879] ksys_read+0x67/0xe0 > [905905.071436] __x64_sys_read+0x1a/0x20 > [905905.072934] do_syscall_64+0x38/0x90 > [905905.074409] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [905905.075841] RIP: 0033:0x4a5c20 > [905905.077230] Code: 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 49 c7 c2 > 00 00 00 00 49 c7 c0 00 00 00 00 49 c7 c1 00 00 00 00 48 8b 44 24 08 0f 05 > <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30 > [905905.080065] RSP: 002b:000000c00016e8b8 EFLAGS: 00000212 ORIG_RAX: > 0000000000000000 > [905905.081441] RAX: ffffffffffffffda RBX: 000000c000030a00 RCX: > 00000000004a5c20 > [905905.082878] RDX: 0000000000001000 RSI: 000000c00062c000 RDI: > 0000000000000006 > [905905.084299] RBP: 000000c00016e908 R08: 0000000000000000 R09: > 0000000000000000 > [905905.085673] R10: 0000000000000000 R11: 0000000000000212 R12: > 0000000000000040 > [905905.087018] R13: 0000000000000040 R14: 0000000000b6420c R15: > 0000000000000000 > [905905.088305] Modules linked in: ufs qnx4 hfsplus hfs minix ntfs msdos > jfs xfs cpuid binfmt_misc overlay cuse bonding rdma_ucm(OE) rdma_cm(OE) > iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) nls_iso8859_1 dm_multipath > scsi_dh_rdac scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common > isst_if_common skx_edac nfit x86_pkg_temp_thermal coretemp kvm_intel kvm > rapl ipmi_ssif intel_cstate mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) > efi_pstore input_leds joydev intel_pch_thermal mei_me mei ioatdma acpi_ipmi > ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid sch_fq_codel > msr ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 > async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq > libcrc32c raid0 multipath linear hid_generic usbhid hid ses enclosure > scsi_transport_sas raid1 mlx5_core(OE) crct10dif_pclmul ast drm_vram_helper > i2c_algo_bit crc32_pclmul drm_ttm_helper ttm ghash_clmulni_intel > aesni_intel drm_kms_helper syscopyarea pci_hyperv_intf sysfillrect > [905905.088448] crypto_simd mlxdevm(OE) sysimgblt cryptd fb_sys_fops > psample glue_helper cec mlxfw(OE) rc_core ixgbe tls drm mlx_compat(OE) > xfrm_algo megaraid_sas dca mdio vmd i2c_i801 xhci_pci ahci i2c_smbus > lpc_ich xhci_pci_renesas libahci wmi > [905905.102788] CR2: 00000000000000c0 > [905905.104158] ---[ end trace 2e934962aff06160 ]--- > [905905.155465] RIP: 0010:blk_mq_put_rq_ref+0xa/0x60 > [905905.157004] Code: 15 0f b6 d3 4c 89 e7 be 01 00 00 00 e8 cf fe ff ff > 5b 41 5c 5d c3 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 47 10 > <48> 8b 80 c0 00 00 00 48 89 e5 48 3b 78 40 74 1f 4c 8d 87 e8 00 00 > [905905.160064] RSP: 0018:ffffb7713ca3fb08 EFLAGS: 00010202 > [905905.161691] RAX: 0000000000000000 RBX: ffffb7713ca3fb90 RCX: > 0000000000000002 > [905905.163258] RDX: 0000000000000001 RSI: 0000000000000206 RDI: > ffff9ea4f46ce400 > [905905.164868] RBP: ffffb7713ca3fb40 R08: 0000000000000000 R09: > 0000000000000005 > [905905.166291] R10: 0000000000000825 R11: 000000000000000b R12: > ffff9ea4f46ce400 > [905905.167694] R13: ffff9ea4f426ec00 R14: 0000000000000000 R15: > 0000000000000001 > [905905.169103] FS: 000000c000507910(0000) GS:ffff9f033fb80000(0000) > knlGS:0000000000000000 > [905905.170554] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [905905.172001] CR2: 00000000000000c0 CR3: 000000010d2bc001 CR4: > 00000000007706e0 > [905905.173576] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [905905.174942] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > 0000000000000400 > [905905.176424] PKRU: 55555554 > root@prdrepoceph01:~# > > > I'm going to look at OSD logs to see if I can find any smoking guns, but > thought I would reach out in case someone had seen this before. > > Thanks, > Marco > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx