Hi Jeremy, I think there is a maintenance mode for Ceph, maybe check this <https://docs.ceph.com/en/latest/dev/cephadm/host-maintenance/> out or maybe this <https://docs.mirantis.com/mcp/q4-18/mcp-operations-guide/openstack-operations/ceph-operations/shut-down-ceph-cluster.html> could help too. Thanks, Dhairya On Mon, May 30, 2022 at 9:41 AM Jeremy Hansen <jeremy@xxxxxxxxxx> wrote: > So in my experience so far, if I take out a switch after a firmware update > and a reboot of the switch, meaning all ceph nodes lose network > connectivity and communication with each other, Ceph becomes unresponsive > and my only fix up to this point has been to, one by one, reboot the > compute nodes. Are you saying I just need to wait? I don’t know how long > I’ve waited in the past, but if you’re saying at least 10 minutes, I > probably haven’t waited that long. > > Thanks > -jeremy > > > On Sunday, May 29, 2022 at 3:40 PM, Tyler Stachecki < > stachecki.tyler@xxxxxxxxx (mailto:stachecki.tyler@xxxxxxxxx)> wrote: > > Ceph always aims to provide high availability. So, if you do not set > cluster flags that prevent Ceph from trying to self-heal, it will self-heal. > > > > Based on your description, it sounds like you want to consider the > 'noout' flag. By default, after 10(?) minutes of an OSD being down, Ceph > will begin the process of outing the affected OSD to ensure high > availability. > > > > But be careful, as far as latency goes -- you likely still want to > pre-emptively mark OSDs down ahead of the planned maintenance for latency > purposes, and you must be cognisant of whether or not your replication > policy puts you in a position where an unrelated failure during the > maintenance can result in inactive PGs. > > > > Cheers, > > Tyler > > > > > > On Sun, May 29, 2022, 5:30 PM Jeremy Hansen <jeremy@xxxxxxxxxx (mailto: > jeremy@xxxxxxxxxx)> wrote: > > > Is there a maintenance mode for Ceph that would allow me to do work on > underlying network equipment without causing Ceph to panic? In our test > lab, we don’t have redundant networking and when doing switch upgrades and > such, Ceph has a panic attack and we end up having to reboot Ceph nodes > anyway. Like an hdfs style readonly mode or something? > > > > > > Thanks! > > > > > > _______________________________________________ > > > ceph-users mailing list -- ceph-users@xxxxxxx (mailto: > ceph-users@xxxxxxx) > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx (mailto: > ceph-users-leave@xxxxxxx) > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx