Network Flapping Causing Slow Ops and Freezing VMs

mahnoosh shahidi <mahnooosh.shd@xxxxxxxxx> · Sat, 6 Jan 2024 16:57:12 +0330

Hi all,

I hope this message finds you well. We recently encountered an issue on one
of our OSD servers, leading to network flapping and subsequently causing
significant performance degradation across our entire cluster. Although the
OSDs were correctly marked as down in the monitor, slow ops persisted until
we resolved the network issue. This incident resulted in a major
disruption, especially affecting VMs with mapped RBD images, leading to
their freezing.

In light of this, I have two key questions for the community:

1. Why did slow ops persist even after marking the affected server as down
in the monitor?

2.Are there any recommended configurations for OSD suicide or OSD down
reports that could help us better handle similar network-related issues in
the future?

Best Regards,
Mahnoosh
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx