On May 1, 2023 9:30 pm, Peter wrote: > Hi Fabian, > > Thank you for your prompt response. It's crucial to understand how things work, and I appreciate your assistance. > > After replacing the switch for our Ceph environment, we experienced three days of normalcy before the issue recurred this morning. I noticed that the TCP in/out became unstable, and TCP errors occurred simultaneously. The UDP in/out values were 70K and 150K, respectively, while the errors peaked at around 50K per second. > > I reviewed the Proxmox documentation and found that it is recommended to separate the cluster network and storage network. Currently, we have more than 20 Ceph nodes across five different locations, and only one location has experienced this issue. We are fortunate that it has not happened in other areas. While we plan to separate the network soon, I was wondering if there are any temporary solutions or configurations that could limit the UDP triggering and resolve the "corosync" issue. the only real solution is separating the links. you can try to prioritize Corosync traffic (UDP on ports 540X) on your switches to avoid the links going over the threshold where Corosync marks them as down. links going down could cause them to start flapping (if they are not really down, but just the Corosync heartbeat timing out occasionally) and trigger an increased amount of traffic cause of retransmits and resync operations trying to reestablish the cluster membership, that could then in turn also affect other traffic going over the same links. > I appreciate your help in this matter and look forward to your response. > > Peter _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx