Re: 16.2.6 SMP NOPTI - OSD down - Node Exporter Tainted

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

Not sure whether this is perhaps related:
https://bugs.launchpad.net/ubuntu/+source/linux-meta-gcp-5.11/+bug/1948471

Any insight would be appreciated

Thanks,
Marco



On Wed, Nov 17, 2021 at 9:18 AM Marco Pizzolo <marcopizzolo@xxxxxxxxx>
wrote:

> Good day everyone,
>
> This is a bit of a recurring theme for us on a new deployment performed at
> 16.2.6 on Ubuntu 20.04.3 with HWE stack.
>
> We have had good stability over the past 3 weeks or so copying data, and
> we now have about 230M objects (470TB of 1PB used) and we have had 1 OSD
> drop from each of the two OSD hosts currently in this cluster.  The third
> node is a monitor but not an OSD host at this time.  Size is set at 2 for
> now as we're migrating from an older Nautilus cluster.
>
> I'm seeing the following on one host:
>
> [905905.006041] BUG: kernel NULL pointer dereference, address:
> 00000000000000c0
> [905905.008536] #PF: supervisor read access in kernel mode
> [905905.011112] #PF: error_code(0x0000) - not-present page
> [905905.013859] PGD 10f1b6067 P4D 10f1b6067 PUD 11bd2e067 PMD 0
> [905905.016678] Oops: 0000 [#1] SMP NOPTI
> [905905.018982] CPU: 86 PID: 69590 Comm: node_exporter Tainted: G
>   OE     5.11.0-38-generic #42~20.04.1-Ubuntu
> [905905.020582] Hardware name: Supermicro SSG-6049P-E1CR60L+/X11DSC+, BIOS
> 3.3 02/21/2020
> [905905.022192] RIP: 0010:blk_mq_put_rq_ref+0xa/0x60
> [905905.023724] Code: 15 0f b6 d3 4c 89 e7 be 01 00 00 00 e8 cf fe ff ff
> 5b 41 5c 5d c3 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 47 10
> <48> 8b 80 c0 00 00 00 48 89 e5 48 3b 78 40 74 1f 4c 8d 87 e8 00 00
> [905905.026794] RSP: 0018:ffffb7713ca3fb08 EFLAGS: 00010202
> [905905.029492] RAX: 0000000000000000 RBX: ffffb7713ca3fb90 RCX:
> 0000000000000002
> [905905.031813] RDX: 0000000000000001 RSI: 0000000000000206 RDI:
> ffff9ea4f46ce400
> [905905.034098] RBP: ffffb7713ca3fb40 R08: 0000000000000000 R09:
> 0000000000000005
> [905905.036360] R10: 0000000000000825 R11: 000000000000000b R12:
> ffff9ea4f46ce400
> [905905.038564] R13: ffff9ea4f426ec00 R14: 0000000000000000 R15:
> 0000000000000001
> [905905.040345] FS:  000000c000507910(0000) GS:ffff9f033fb80000(0000)
> knlGS:0000000000000000
> [905905.042034] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [905905.043660] CR2: 00000000000000c0 CR3: 000000010d2bc001 CR4:
> 00000000007706e0
> [905905.045298] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [905905.046883] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [905905.048424] PKRU: 55555554
> [905905.049979] Call Trace:
> [905905.051907]  ? bt_iter+0x54/0x90
> [905905.053815]  blk_mq_queue_tag_busy_iter+0x18b/0x2d0
> [905905.055730]  ? blk_mq_hctx_mark_pending+0x70/0x70
> [905905.057711]  ? blk_mq_hctx_mark_pending+0x70/0x70
> [905905.059548]  blk_mq_in_flight+0x38/0x60
> [905905.061409]  diskstats_show+0x75/0x2b0
> [905905.063166]  seq_read_iter+0x2a3/0x450
> [905905.064871]  proc_reg_read_iter+0x5e/0x80
> [905905.066648]  new_sync_read+0x110/0x1a0
> [905905.068276]  vfs_read+0x154/0x1b0
> [905905.069879]  ksys_read+0x67/0xe0
> [905905.071436]  __x64_sys_read+0x1a/0x20
> [905905.072934]  do_syscall_64+0x38/0x90
> [905905.074409]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [905905.075841] RIP: 0033:0x4a5c20
> [905905.077230] Code: 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 49 c7 c2
> 00 00 00 00 49 c7 c0 00 00 00 00 49 c7 c1 00 00 00 00 48 8b 44 24 08 0f 05
> <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
> [905905.080065] RSP: 002b:000000c00016e8b8 EFLAGS: 00000212 ORIG_RAX:
> 0000000000000000
> [905905.081441] RAX: ffffffffffffffda RBX: 000000c000030a00 RCX:
> 00000000004a5c20
> [905905.082878] RDX: 0000000000001000 RSI: 000000c00062c000 RDI:
> 0000000000000006
> [905905.084299] RBP: 000000c00016e908 R08: 0000000000000000 R09:
> 0000000000000000
> [905905.085673] R10: 0000000000000000 R11: 0000000000000212 R12:
> 0000000000000040
> [905905.087018] R13: 0000000000000040 R14: 0000000000b6420c R15:
> 0000000000000000
> [905905.088305] Modules linked in: ufs qnx4 hfsplus hfs minix ntfs msdos
> jfs xfs cpuid binfmt_misc overlay cuse bonding rdma_ucm(OE) rdma_cm(OE)
> iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) nls_iso8859_1 dm_multipath
> scsi_dh_rdac scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common
> isst_if_common skx_edac nfit x86_pkg_temp_thermal coretemp kvm_intel kvm
> rapl ipmi_ssif intel_cstate mlx5_ib(OE) ib_uverbs(OE) ib_core(OE)
> efi_pstore input_leds joydev intel_pch_thermal mei_me mei ioatdma acpi_ipmi
> ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid sch_fq_codel
> msr ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456
> async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq
> libcrc32c raid0 multipath linear hid_generic usbhid hid ses enclosure
> scsi_transport_sas raid1 mlx5_core(OE) crct10dif_pclmul ast drm_vram_helper
> i2c_algo_bit crc32_pclmul drm_ttm_helper ttm ghash_clmulni_intel
> aesni_intel drm_kms_helper syscopyarea pci_hyperv_intf sysfillrect
> [905905.088448]  crypto_simd mlxdevm(OE) sysimgblt cryptd fb_sys_fops
> psample glue_helper cec mlxfw(OE) rc_core ixgbe tls drm mlx_compat(OE)
> xfrm_algo megaraid_sas dca mdio vmd i2c_i801 xhci_pci ahci i2c_smbus
> lpc_ich xhci_pci_renesas libahci wmi
> [905905.102788] CR2: 00000000000000c0
> [905905.104158] ---[ end trace 2e934962aff06160 ]---
> [905905.155465] RIP: 0010:blk_mq_put_rq_ref+0xa/0x60
> [905905.157004] Code: 15 0f b6 d3 4c 89 e7 be 01 00 00 00 e8 cf fe ff ff
> 5b 41 5c 5d c3 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 47 10
> <48> 8b 80 c0 00 00 00 48 89 e5 48 3b 78 40 74 1f 4c 8d 87 e8 00 00
> [905905.160064] RSP: 0018:ffffb7713ca3fb08 EFLAGS: 00010202
> [905905.161691] RAX: 0000000000000000 RBX: ffffb7713ca3fb90 RCX:
> 0000000000000002
> [905905.163258] RDX: 0000000000000001 RSI: 0000000000000206 RDI:
> ffff9ea4f46ce400
> [905905.164868] RBP: ffffb7713ca3fb40 R08: 0000000000000000 R09:
> 0000000000000005
> [905905.166291] R10: 0000000000000825 R11: 000000000000000b R12:
> ffff9ea4f46ce400
> [905905.167694] R13: ffff9ea4f426ec00 R14: 0000000000000000 R15:
> 0000000000000001
> [905905.169103] FS:  000000c000507910(0000) GS:ffff9f033fb80000(0000)
> knlGS:0000000000000000
> [905905.170554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [905905.172001] CR2: 00000000000000c0 CR3: 000000010d2bc001 CR4:
> 00000000007706e0
> [905905.173576] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [905905.174942] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [905905.176424] PKRU: 55555554
> root@prdrepoceph01:~#
>
>
> I'm going to look at OSD logs to see if I can find any smoking guns, but
> thought I would reach out in case someone had seen this before.
>
> Thanks,
> Marco
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux