16.2.6 SMP NOPTI - OSD down - Node Exporter Tainted

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Good day everyone,

This is a bit of a recurring theme for us on a new deployment performed at
16.2.6 on Ubuntu 20.04.3 with HWE stack.

We have had good stability over the past 3 weeks or so copying data, and we
now have about 230M objects (470TB of 1PB used) and we have had 1 OSD drop
from each of the two OSD hosts currently in this cluster.  The third node
is a monitor but not an OSD host at this time.  Size is set at 2 for now as
we're migrating from an older Nautilus cluster.

I'm seeing the following on one host:

[905905.006041] BUG: kernel NULL pointer dereference, address:
00000000000000c0
[905905.008536] #PF: supervisor read access in kernel mode
[905905.011112] #PF: error_code(0x0000) - not-present page
[905905.013859] PGD 10f1b6067 P4D 10f1b6067 PUD 11bd2e067 PMD 0
[905905.016678] Oops: 0000 [#1] SMP NOPTI
[905905.018982] CPU: 86 PID: 69590 Comm: node_exporter Tainted: G
OE     5.11.0-38-generic #42~20.04.1-Ubuntu
[905905.020582] Hardware name: Supermicro SSG-6049P-E1CR60L+/X11DSC+, BIOS
3.3 02/21/2020
[905905.022192] RIP: 0010:blk_mq_put_rq_ref+0xa/0x60
[905905.023724] Code: 15 0f b6 d3 4c 89 e7 be 01 00 00 00 e8 cf fe ff ff 5b
41 5c 5d c3 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 47 10
<48> 8b 80 c0 00 00 00 48 89 e5 48 3b 78 40 74 1f 4c 8d 87 e8 00 00
[905905.026794] RSP: 0018:ffffb7713ca3fb08 EFLAGS: 00010202
[905905.029492] RAX: 0000000000000000 RBX: ffffb7713ca3fb90 RCX:
0000000000000002
[905905.031813] RDX: 0000000000000001 RSI: 0000000000000206 RDI:
ffff9ea4f46ce400
[905905.034098] RBP: ffffb7713ca3fb40 R08: 0000000000000000 R09:
0000000000000005
[905905.036360] R10: 0000000000000825 R11: 000000000000000b R12:
ffff9ea4f46ce400
[905905.038564] R13: ffff9ea4f426ec00 R14: 0000000000000000 R15:
0000000000000001
[905905.040345] FS:  000000c000507910(0000) GS:ffff9f033fb80000(0000)
knlGS:0000000000000000
[905905.042034] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[905905.043660] CR2: 00000000000000c0 CR3: 000000010d2bc001 CR4:
00000000007706e0
[905905.045298] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[905905.046883] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[905905.048424] PKRU: 55555554
[905905.049979] Call Trace:
[905905.051907]  ? bt_iter+0x54/0x90
[905905.053815]  blk_mq_queue_tag_busy_iter+0x18b/0x2d0
[905905.055730]  ? blk_mq_hctx_mark_pending+0x70/0x70
[905905.057711]  ? blk_mq_hctx_mark_pending+0x70/0x70
[905905.059548]  blk_mq_in_flight+0x38/0x60
[905905.061409]  diskstats_show+0x75/0x2b0
[905905.063166]  seq_read_iter+0x2a3/0x450
[905905.064871]  proc_reg_read_iter+0x5e/0x80
[905905.066648]  new_sync_read+0x110/0x1a0
[905905.068276]  vfs_read+0x154/0x1b0
[905905.069879]  ksys_read+0x67/0xe0
[905905.071436]  __x64_sys_read+0x1a/0x20
[905905.072934]  do_syscall_64+0x38/0x90
[905905.074409]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[905905.075841] RIP: 0033:0x4a5c20
[905905.077230] Code: 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 49 c7 c2 00
00 00 00 49 c7 c0 00 00 00 00 49 c7 c1 00 00 00 00 48 8b 44 24 08 0f 05
<48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
[905905.080065] RSP: 002b:000000c00016e8b8 EFLAGS: 00000212 ORIG_RAX:
0000000000000000
[905905.081441] RAX: ffffffffffffffda RBX: 000000c000030a00 RCX:
00000000004a5c20
[905905.082878] RDX: 0000000000001000 RSI: 000000c00062c000 RDI:
0000000000000006
[905905.084299] RBP: 000000c00016e908 R08: 0000000000000000 R09:
0000000000000000
[905905.085673] R10: 0000000000000000 R11: 0000000000000212 R12:
0000000000000040
[905905.087018] R13: 0000000000000040 R14: 0000000000b6420c R15:
0000000000000000
[905905.088305] Modules linked in: ufs qnx4 hfsplus hfs minix ntfs msdos
jfs xfs cpuid binfmt_misc overlay cuse bonding rdma_ucm(OE) rdma_cm(OE)
iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) nls_iso8859_1 dm_multipath
scsi_dh_rdac scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common
isst_if_common skx_edac nfit x86_pkg_temp_thermal coretemp kvm_intel kvm
rapl ipmi_ssif intel_cstate mlx5_ib(OE) ib_uverbs(OE) ib_core(OE)
efi_pstore input_leds joydev intel_pch_thermal mei_me mei ioatdma acpi_ipmi
ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid sch_fq_codel
msr ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456
async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq
libcrc32c raid0 multipath linear hid_generic usbhid hid ses enclosure
scsi_transport_sas raid1 mlx5_core(OE) crct10dif_pclmul ast drm_vram_helper
i2c_algo_bit crc32_pclmul drm_ttm_helper ttm ghash_clmulni_intel
aesni_intel drm_kms_helper syscopyarea pci_hyperv_intf sysfillrect
[905905.088448]  crypto_simd mlxdevm(OE) sysimgblt cryptd fb_sys_fops
psample glue_helper cec mlxfw(OE) rc_core ixgbe tls drm mlx_compat(OE)
xfrm_algo megaraid_sas dca mdio vmd i2c_i801 xhci_pci ahci i2c_smbus
lpc_ich xhci_pci_renesas libahci wmi
[905905.102788] CR2: 00000000000000c0
[905905.104158] ---[ end trace 2e934962aff06160 ]---
[905905.155465] RIP: 0010:blk_mq_put_rq_ref+0xa/0x60
[905905.157004] Code: 15 0f b6 d3 4c 89 e7 be 01 00 00 00 e8 cf fe ff ff 5b
41 5c 5d c3 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 47 10
<48> 8b 80 c0 00 00 00 48 89 e5 48 3b 78 40 74 1f 4c 8d 87 e8 00 00
[905905.160064] RSP: 0018:ffffb7713ca3fb08 EFLAGS: 00010202
[905905.161691] RAX: 0000000000000000 RBX: ffffb7713ca3fb90 RCX:
0000000000000002
[905905.163258] RDX: 0000000000000001 RSI: 0000000000000206 RDI:
ffff9ea4f46ce400
[905905.164868] RBP: ffffb7713ca3fb40 R08: 0000000000000000 R09:
0000000000000005
[905905.166291] R10: 0000000000000825 R11: 000000000000000b R12:
ffff9ea4f46ce400
[905905.167694] R13: ffff9ea4f426ec00 R14: 0000000000000000 R15:
0000000000000001
[905905.169103] FS:  000000c000507910(0000) GS:ffff9f033fb80000(0000)
knlGS:0000000000000000
[905905.170554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[905905.172001] CR2: 00000000000000c0 CR3: 000000010d2bc001 CR4:
00000000007706e0
[905905.173576] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[905905.174942] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[905905.176424] PKRU: 55555554
root@prdrepoceph01:~#


I'm going to look at OSD logs to see if I can find any smoking guns, but
thought I would reach out in case someone had seen this before.

Thanks,
Marco
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux