Re: Maintenance mode?

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Mon, 30 May 2022 12:58:59 -0700

This sounds like all your nodes are on a single switch, which for production is risky, for this reason and others.

If that’s the case, I suggest shutting down the cluster completely in advance, as described in the docs.

> On May 29, 2022, at 9:10 PM, Jeremy Hansen <jeremy@xxxxxxxxxx> wrote:
> 
> So in my experience so far, if I take out a switch after a firmware update and a reboot of the switch, meaning all ceph nodes lose network connectivity and communication with each other, Ceph becomes unresponsive and my only fix up to this point has been to, one by one, reboot the compute nodes. Are you saying I just need to wait? I don’t know how long I’ve waited in the past, but if you’re saying at least 10 minutes, I probably haven’t waited that long.
> 
> Thanks
> -jeremy
> 
>> On Sunday, May 29, 2022 at 3:40 PM, Tyler Stachecki <stachecki.tyler@xxxxxxxxx (mailto:stachecki.tyler@xxxxxxxxx)> wrote:
>> Ceph always aims to provide high availability. So, if you do not set cluster flags that prevent Ceph from trying to self-heal, it will self-heal.
>> 
>> Based on your description, it sounds like you want to consider the 'noout' flag. By default, after 10(?) minutes of an OSD being down, Ceph will begin the process of outing the affected OSD to ensure high availability.
>> 
>> But be careful, as far as latency goes -- you likely still want to pre-emptively mark OSDs down ahead of the planned maintenance for latency purposes, and you must be cognisant of whether or not your replication policy puts you in a position where an unrelated failure during the maintenance can result in inactive PGs.
>> 
>> Cheers,
>> Tyler
>> 
>> 
>> On Sun, May 29, 2022, 5:30 PM Jeremy Hansen <jeremy@xxxxxxxxxx (mailto:jeremy@xxxxxxxxxx)> wrote:
>>> Is there a maintenance mode for Ceph that would allow me to do work on underlying network equipment without causing Ceph to panic? In our test lab, we don’t have redundant networking and when doing switch upgrades and such, Ceph has a panic attack and we end up having to reboot Ceph nodes anyway. Like an hdfs style readonly mode or something?
>>> 
>>> Thanks!
>>> 
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx (mailto:ceph-users@xxxxxxx)
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx (mailto:ceph-users-leave@xxxxxxx)
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx