Re: Patching Ceph cluster

Sake Ceph <ceph@xxxxxxxxxxx> · Thu, 13 Jun 2024 22:05:58 +0200 (CEST)

Yeah we fully automated this with Ansible. In short we do the following. 

1. Check if cluster is healthy before continuing (via REST-API) only health_ok is good
2. Disable scrub and deep-scrub
3. Update all applications on all the hosts in the cluster
4. For every host, one by one, do the following:
4a. Check if applications got updated
4b. Check via reboot-hint if a reboot is necessary
4c. If applications got updated or reboot is necessary, do the following :
4c1. Put host in maintenance 
4c2. Reboot host if necessary 
4c3. Check and wait via 'ceph orch host ls' if status of the host is maintance and nothing else
4c4. Get host out of maintenance 
4d. Check if cluster is healthy before continuing (via Rest-API) only warning about scrub and deep-scrub is allowed, but no pg's should be degraded 
5. Enable scrub and deep-scrub when all hosts are done
6. Check if cluster is healthy (via Rest-API) only health_ok is good
7. Done

For upgrade the OS we have something similar, but exiting maintenance mode is broken (with 17.2.7) :(
I need to check the tracker for similar issues and if I can't find anything, I will create a ticket. 

Kind regards, 
Sake 

> Op 12-06-2024 19:02 CEST schreef Daniel Brown <daniel.h.brown@thermify.cloud>:
> 
>  
> I have two ansible roles, one for enter, one for exit. There’s likely better ways to do this — and I’ll not be surprised if someone here lets me know. They’re using orch commands via the cephadm shell. I’m using Ansible for other configuration management in my environment, as well, including setting up clients of the ceph cluster. 
> 
> 
> Below excerpts from main.yml in the “tasks” for the enter/exit roles. The host I’m running ansible from is one of my CEPH servers - I’ve limited which process run there though so it’s in the cluster but not equal to the others. 
> 
> 
> —————
> Enter
> —————
> 
> - name: Ceph Maintenance Mode Enter
>   shell:
> 
>     cmd: ' cephadm shell ceph orch host maintenance enter {{ (ansible_ssh_host|default(ansible_host))|default(inventory_hostname) }} --force --yes-i-really-mean-it ‘
>   become: True
> 
> 
> 
> —————
> Exit
> ————— 
> 
> 
> - name: Ceph Maintenance Mode Exit
>   shell:
>     cmd: 'cephadm shell ceph orch host maintenance exit {{ (ansible_ssh_host|default(ansible_host))|default(inventory_hostname) }} ‘
>   become: True
>   connection: local
> 
> 
> - name: Wait for Ceph to be available
>   ansible.builtin.wait_for:
>     delay: 60
>     host: '{{ (ansible_ssh_host|default(ansible_host))|default(inventory_hostname) }}’
>     port: 9100
>   connection: local
> 
> 
> 
> 
> 
> 
> > On Jun 12, 2024, at 11:28 AM, Michael Worsham <mworsham@xxxxxxxxxxxxxxxxxx> wrote:
> > 
> > Interesting. How do you set this "maintenance mode"? If you have a series of documented steps that you have to do and could provide as an example, that would be beneficial for my efforts.
> > 
> > We are in the process of standing up both a dev-test environment consisting of 3 Ceph servers (strictly for testing purposes) and a new production environment consisting of 20+ Ceph servers.
> > 
> > We are using Ubuntu 22.04.
> > 
> > -- Michael
> > From: Daniel Brown <daniel.h.brown@thermify.cloud>
> > Sent: Wednesday, June 12, 2024 9:18 AM
> > To: Anthony D'Atri <anthony.datri@xxxxxxxxx>
> > Cc: Michael Worsham <mworsham@xxxxxxxxxxxxxxxxxx>; ceph-users@xxxxxxx <ceph-users@xxxxxxx>
> > Subject: Re:  Patching Ceph cluster
> >  This is an external email. Please take care when clicking links or opening attachments. When in doubt, check with the Help Desk or Security.
> > 
> > 
> > There’s also a Maintenance mode that you can set for each server, as you’re doing updates, so that the cluster doesn’t try to move data from affected OSD’s, while the server being updated is offline or down. I’ve worked some on automating this with Ansible, but have found my process (and/or my cluster) still requires some manual intervention while it’s running to get things done cleanly.
> > 
> > 
> > 
> > > On Jun 12, 2024, at 8:49 AM, Anthony D'Atri <anthony.datri@xxxxxxxxx> wrote:
> > >
> > > Do you mean patching the OS?
> > >
> > > If so, easy -- one node at a time, then after it comes back up, wait until all PGs are active+clean and the mon quorum is complete before proceeding.
> > >
> > >
> > >
> > >> On Jun 12, 2024, at 07:56, Michael Worsham <mworsham@xxxxxxxxxxxxxxxxxx> wrote:
> > >>
> > >> What is the proper way to patch a Ceph cluster and reboot the servers in said cluster if a reboot is necessary for said updates? And is it possible to automate it via Ansible? This message and its attachments are from Data Dimensions and are intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential, and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately and permanently delete the original email and destroy any copies or printouts of this email as well as any attachments.
> > >> _______________________________________________
> > >> ceph-users mailing list -- ceph-users@xxxxxxx
> > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > 
> > This message and its attachments are from Data Dimensions and are intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential, and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately and permanently delete the original email and destroy any copies or printouts of this email as well as any attachments.
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx