Re: PVE CEPH OSD heartbeat show

Fabian Grünbichler <f.gruenbichler@xxxxxxxxxxx> · Tue, 02 May 2023 10:16:52 +0200

On May 1, 2023 9:30 pm, Peter wrote:
> Hi Fabian,
> 
> Thank you for your prompt response. It's crucial to understand how things work, and I appreciate your assistance.
> 
> After replacing the switch for our Ceph environment, we experienced three days of normalcy before the issue recurred this morning. I noticed that the TCP in/out became unstable, and TCP errors occurred simultaneously. The UDP in/out values were 70K and 150K, respectively, while the errors peaked at around 50K per second.
> 
> I reviewed the Proxmox documentation and found that it is recommended to separate the cluster network and storage network. Currently, we have more than 20 Ceph nodes across five different locations, and only one location has experienced this issue. We are fortunate that it has not happened in other areas. While we plan to separate the network soon, I was wondering if there are any temporary solutions or configurations that could limit the UDP triggering and resolve the "corosync" issue.

the only real solution is separating the links. you can try to
prioritize Corosync traffic (UDP on ports 540X) on your switches to
avoid the links going over the threshold where Corosync marks them as
down. links going down could cause them to start flapping (if they are
not really down, but just the Corosync heartbeat timing out
occasionally) and trigger an increased amount of traffic cause of
retransmits and resync operations trying to reestablish the cluster
membership, that could then in turn also affect other traffic going over
the same links.

> I appreciate your help in this matter and look forward to your response.
> 
> Peter
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx