Re: PVE CEPH OSD heartbeat show

Frank Schilder <frans@xxxxxx> · Wed, 26 Apr 2023 08:21:25 +0000

Hi Peter,

2% packet loss is a lot, specifically on such expensive hardware. We observed the problems you describe with defective networking hardware with NIC/switch ports in active-active LACP bonding mode. We had periodically failing transceivers and these fails are not immediately detected by the host/switch. Only if such a fail sustained over a longer period of time would the kernel finally report a port as down. A failing transceiver on the switch side often went entirely undetected. Ceph will report slow ping times but nothing (host, osd) down, because packets still go through one of the ports. Its only part of the traffic that disappears and often small packets go through while large ones disappear.

This proved very difficult to pin down. With later kernel versions such behavior started to show up with ifconfig as send-receive errors. After replacing a number of transceivers we don't see this happening any more.

Things to check:

- MTU over all components the same
- monitor link utilization to exclude bottlenecks (see Fabian's reply)
- check NIC port error counters on all hosts, they should be close to 0
- during a window you see long ping times, look at which hosts show up more often in the network report (ceph daemon mgr.HOST dump_osd_network) and bring interfaces down/up to see if the situation changes

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Fabian Grünbichler <f.gruenbichler@xxxxxxxxxxx>
Sent: Wednesday, April 26, 2023 9:42 AM
To: ceph-users@xxxxxxx; Peter
Subject:  Re: PVE CEPH  OSD heartbeat show

On April 25, 2023 9:03 pm, Peter wrote:
> Dear all,
>
> We are experiencing with Ceph after deploying it by PVE with the network backed by a 10G Cisco switch with VPC feature on. We are encountering a slow OSD heartbeat and have not been able to identify any network traffic issues.
>
> Upon checking, we found that the ping is around 0.1ms, and there is occasional 2% packet loss when using flood ping, but not consistently. We also noticed a large number of UDP port 5405 packets and the 'corosync' process utilizing a significant amount of CPU.
>
> When running the 'ceph -s' command, we observed a slow OSD heartbeat on the back and front, with the longest latency being 2250.54ms. We suspect that this may be a network issue, but we are unsure of how Ceph detects such long latency. Additionally, we are wondering if a 2% packet loss can significantly affect Ceph's performance and even cause the OSD process to fail sometimes.
>
> We have heard about potential issues with rockdb 6 causing OSD process failures, and we are curious about how to check the rockdb version. Furthermore, we are wondering how severe traffic package loss and latency must be to cause OSD process crashes, and how the monitoring system determines that an OSD is offline.
>
> We would greatly appreciate any assistance or insights you could provide on these matters.
> Thanks,

are you using separate (physical) links for Corosync and Ceph traffic?
if not, they will step on each others toes and cause problems. Corosync
is very latency sensitive.

https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_cluster_network_requirements
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx