Re: Network Flapping Causing Slow Ops and Freezing VMs

mahnoosh shahidi <mahnooosh.shd@xxxxxxxxx> · Mon, 8 Jan 2024 21:45:43 +0330

Hi Eugen

Yes osds were marked as down by mons and there was "wrongly marked as down"
in the logs but the osds were down all the time. Actually I was looking for
a fast fail procedure for these kind of situation cause any manual action
would take time and can causes major incidents.

Best Regards,
Mahnoosh

On Mon, 8 Jan 2024, 11:47 Eugen Block, <eblock@xxxxxx> wrote:

> Hi,
>
> just to get a better understanding, when you write
>
> > Although the OSDs were correctly marked as down in the monitor, slow
> > ops persisted until we resolved the network issue.
>
> do you mean that the MONs marked the OSDs as down (temporarily) or did
> you do that? Because if the OSDs "flap" they would also mark
> themselves "up" all the time, this should be reflected in the OSD
> logs, something like "wrongly marked me down". Can you confirm that
> the daemons were still up and logged the "wrongly marked me down"
> messages?
> In some cases the "nodown" flag can prevent flapping OSDs, but since
> you actually had a network issue it wouldn't really help here. I would
> probably have set the noout flag and stop the OSD daemons on the
> affected node until the issue was resolved.
>
> Regards,
> Eugen
>
> Zitat von mahnoosh shahidi <mahnooosh.shd@xxxxxxxxx>:
>
> > Hi all,
> >
> > I hope this message finds you well. We recently encountered an issue on
> one
> > of our OSD servers, leading to network flapping and subsequently causing
> > significant performance degradation across our entire cluster. Although
> the
> > OSDs were correctly marked as down in the monitor, slow ops persisted
> until
> > we resolved the network issue. This incident resulted in a major
> > disruption, especially affecting VMs with mapped RBD images, leading to
> > their freezing.
> >
> > In light of this, I have two key questions for the community:
> >
> > 1. Why did slow ops persist even after marking the affected server as
> down
> > in the monitor?
> >
> > 2.Are there any recommended configurations for OSD suicide or OSD down
> > reports that could help us better handle similar network-related issues
> in
> > the future?
> >
> > Best Regards,
> > Mahnoosh
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx