Re: cephadm host maintenance

Steven Goodliff <Steven.Goodliff@xxxxxxxxxxxxxxx> · Thu, 14 Jul 2022 09:01:19 +0000

Thanks for the replies,

It feels to me that cephadm should handle this case as it offers the maintenance function. right now i have a simple version of a playbook that just does the noout / patch the OS and reboot and unset noout ( similar to https://github.com/ceph/ceph-ansible/blob/main/infrastructure-playbooks/untested-by-ci/cluster-maintenance.yml ) and a different version that attempts the host maintenance but fails on the instance that is running the mgr. If i get anywhere with detecting the instance is the active manager handling that in Ansible i will reply back here.

Cheers

Steven Goodliff

________________________________
From: Robert Gallop <robert.gallop@xxxxxxxxx>
Sent: 13 July 2022 16:55
To: Adam King
Cc: Steven Goodliff; ceph-users@xxxxxxx
Subject: Re:  Re: cephadm host maintenance

This brings up a good follow on…. Rebooting in general for OS patching.

I have not been leveraging the maintenance mode function, as I found it was really no different than just setting noout and doing the reboot.  I find if the box is the active manager the failover happens quick, painless and automatically.  All the OSD’s just show as missing and come back once the box is back from reboot…

Am I causing issues I may not be aware of?  How is everyone handling patching reboots?

The only place I’m careful is the active MDS nodes, since that failover does cause a period of no i/o for the mounted clients, I generally fail that manually so I can ensure I don’t have to wait for the MDS to figure out an instance is gone and spin up a standby….

Any tips or techniques until there is a more holistic approach?

Thanks!

On Wed, Jul 13, 2022 at 9:49 AM Adam King <adking@xxxxxxxxxx<mailto:adking@xxxxxxxxxx>> wrote:
Hello Steven,

Arguably, it should, but right now nothing is implemented to do so and
you'd have to manually run the "ceph mgr fail
node2-cobj2-atdev1-nvan.ghxlvw" before it would allow you to put the host
in maintenance. It's non-trivial from a technical point of view to have it
automatically do the switch as the cephadm instance is running on that
active mgr, so it will have to store somewhere that we wanted this host in
maintenance, fail over the mgr itself, then have the new cephadm instance
pick up that we wanted the host in maintenance and do so. Possible, but not
something anyone has had a chance to implement. FWIW, I do believe there
are also plans to eventually have a playbook for a rolling reboot or
something of the sort added to https://github.com/ceph/cephadm-ansible. But
for now, I think some sort of intervention to cause the fail over to happen
before running the maintenance enter command is necessary.

Regards,
 - Adam King

On Wed, Jul 13, 2022 at 11:02 AM Steven Goodliff <
Steven.Goodliff@xxxxxxxxxxxxxxx<mailto:Steven.Goodliff@xxxxxxxxxxxxxxx>> wrote:

>
> Hi,
>
>
> I'm trying to reboot a ceph cluster one instance at a time by running in a
> Ansible playbook which basically runs
>
>
> cephadm shell ceph orch host maintenance enter <hostname>  and then
> reboots the instance and exits the maintenance
>
>
> but i get
>
>
> ALERT: Cannot stop active Mgr daemon, Please switch active Mgrs with 'ceph
> mgr fail node2-cobj2-atdev1-nvan.ghxlvw'
>
>
> on one instance.  should cephadm handle the switch ?
>
>
> thanks
>
> Steven Goodliff
> Global Relay
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx