Re: Network Flapping Causing Slow Ops and Freezing VMs

Eugen Block <eblock@xxxxxx> · Mon, 08 Jan 2024 18:46:46 +0000

You didn't mention which ceph version you're running, assuming that  
it's managed by cephadm you could put the host in maintenance mode [1]  
which stops all services and then adds the no-out flag for that host  
to prevent unnecessary recovery.
Once the maintenance is done, exit the maintenance mode and the  
services should start again. Note that all ceph services would be  
stopped, so MONs too.

[1] https://docs.ceph.com/en/latest/cephadm/host-management/

Zitat von mahnoosh shahidi <mahnooosh.shd@xxxxxxxxx>:

Hi Eugen

Yes osds were marked as down by mons and there was "wrongly marked as down"
in the logs but the osds were down all the time. Actually I was looking for
a fast fail procedure for these kind of situation cause any manual action
would take time and can causes major incidents.

Best Regards,
Mahnoosh

On Mon, 8 Jan 2024, 11:47 Eugen Block, <eblock@xxxxxx> wrote:

Hi,

just to get a better understanding, when you write

> Although the OSDs were correctly marked as down in the monitor, slow
> ops persisted until we resolved the network issue.

do you mean that the MONs marked the OSDs as down (temporarily) or did
you do that? Because if the OSDs "flap" they would also mark
themselves "up" all the time, this should be reflected in the OSD
logs, something like "wrongly marked me down". Can you confirm that
the daemons were still up and logged the "wrongly marked me down"
messages?
In some cases the "nodown" flag can prevent flapping OSDs, but since
you actually had a network issue it wouldn't really help here. I would
probably have set the noout flag and stop the OSD daemons on the
affected node until the issue was resolved.

Regards,
Eugen

Zitat von mahnoosh shahidi <mahnooosh.shd@xxxxxxxxx>:

> Hi all,
>
> I hope this message finds you well. We recently encountered an issue on
one
> of our OSD servers, leading to network flapping and subsequently causing
> significant performance degradation across our entire cluster. Although
the
> OSDs were correctly marked as down in the monitor, slow ops persisted
until
> we resolved the network issue. This incident resulted in a major
> disruption, especially affecting VMs with mapped RBD images, leading to
> their freezing.
>
> In light of this, I have two key questions for the community:
>
> 1. Why did slow ops persist even after marking the affected server as
down
> in the monitor?
>
> 2.Are there any recommended configurations for OSD suicide or OSD down
> reports that could help us better handle similar network-related issues
in
> the future?
>
> Best Regards,
> Mahnoosh
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx