Hi Eugen Yes osds were marked as down by mons and there was "wrongly marked as down" in the logs but the osds were down all the time. Actually I was looking for a fast fail procedure for these kind of situation cause any manual action would take time and can causes major incidents. Best Regards, Mahnoosh On Mon, 8 Jan 2024, 11:47 Eugen Block, <eblock@xxxxxx> wrote: > Hi, > > just to get a better understanding, when you write > > > Although the OSDs were correctly marked as down in the monitor, slow > > ops persisted until we resolved the network issue. > > do you mean that the MONs marked the OSDs as down (temporarily) or did > you do that? Because if the OSDs "flap" they would also mark > themselves "up" all the time, this should be reflected in the OSD > logs, something like "wrongly marked me down". Can you confirm that > the daemons were still up and logged the "wrongly marked me down" > messages? > In some cases the "nodown" flag can prevent flapping OSDs, but since > you actually had a network issue it wouldn't really help here. I would > probably have set the noout flag and stop the OSD daemons on the > affected node until the issue was resolved. > > Regards, > Eugen > > Zitat von mahnoosh shahidi <mahnooosh.shd@xxxxxxxxx>: > > > Hi all, > > > > I hope this message finds you well. We recently encountered an issue on > one > > of our OSD servers, leading to network flapping and subsequently causing > > significant performance degradation across our entire cluster. Although > the > > OSDs were correctly marked as down in the monitor, slow ops persisted > until > > we resolved the network issue. This incident resulted in a major > > disruption, especially affecting VMs with mapped RBD images, leading to > > their freezing. > > > > In light of this, I have two key questions for the community: > > > > 1. Why did slow ops persist even after marking the affected server as > down > > in the monitor? > > > > 2.Are there any recommended configurations for OSD suicide or OSD down > > reports that could help us better handle similar network-related issues > in > > the future? > > > > Best Regards, > > Mahnoosh > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx