Hello Mahnoosh, Just to double check, can you confirm that you are NOT using a physically separate cluster network and private network? A configuration with such physically separate networks is inherently vulnerable and therefore cannot be recommended. VLANs on the same physical interface are probably acceptable, but I have never seen a cluster configured like this. https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-osd/#flapping-osds On Sat, Jan 6, 2024 at 9:28 PM mahnoosh shahidi <mahnooosh.shd@xxxxxxxxx> wrote: > > Hi all, > > I hope this message finds you well. We recently encountered an issue on one > of our OSD servers, leading to network flapping and subsequently causing > significant performance degradation across our entire cluster. Although the > OSDs were correctly marked as down in the monitor, slow ops persisted until > we resolved the network issue. This incident resulted in a major > disruption, especially affecting VMs with mapped RBD images, leading to > their freezing. > > In light of this, I have two key questions for the community: > > 1. Why did slow ops persist even after marking the affected server as down > in the monitor? > > 2.Are there any recommended configurations for OSD suicide or OSD down > reports that could help us better handle similar network-related issues in > the future? > > Best Regards, > Mahnoosh > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx -- Alexander E. Patrakov _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx