Hi, just to get a better understanding, when you write
Although the OSDs were correctly marked as down in the monitor, slow ops persisted until we resolved the network issue.
do you mean that the MONs marked the OSDs as down (temporarily) or did you do that? Because if the OSDs "flap" they would also mark themselves "up" all the time, this should be reflected in the OSD logs, something like "wrongly marked me down". Can you confirm that the daemons were still up and logged the "wrongly marked me down" messages? In some cases the "nodown" flag can prevent flapping OSDs, but since you actually had a network issue it wouldn't really help here. I would probably have set the noout flag and stop the OSD daemons on the affected node until the issue was resolved.
Regards, Eugen Zitat von mahnoosh shahidi <mahnooosh.shd@xxxxxxxxx>:
Hi all, I hope this message finds you well. We recently encountered an issue on one of our OSD servers, leading to network flapping and subsequently causing significant performance degradation across our entire cluster. Although the OSDs were correctly marked as down in the monitor, slow ops persisted until we resolved the network issue. This incident resulted in a major disruption, especially affecting VMs with mapped RBD images, leading to their freezing. In light of this, I have two key questions for the community: 1. Why did slow ops persist even after marking the affected server as down in the monitor? 2.Are there any recommended configurations for OSD suicide or OSD down reports that could help us better handle similar network-related issues in the future? Best Regards, Mahnoosh _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx