Re: 'ceph orch upgrade...' causes an rbd outage on a proxmox cluster

Pierre BELLEMAIN <pierre.bellemain@xxxxxxxxxxxxxx> · Fri, 3 Feb 2023 11:22:45 +0100 (CET)

Hi @all,

I have good news!
Indeed, by passing the datastore in proxmox conf in kernel rbd (krbd) and by putting the IPs separated by commas (separated by spaces before), the VMs do not shut down.
I will do further testing to confirm the info.

Pierre.

----- Mail original -----
> De: "Pierre Bellemain" <pierre.bellemain@xxxxxxxxxxxxxx>
> À: "ceph-users" <ceph-users@xxxxxxx>
> Envoyé: Jeudi 2 Février 2023 18:37:19
> Objet:  'ceph orch upgrade...' causes an rbd outage on a proxmox cluster

> Hi everyone,
> (sorry for the spam, apparently I was not subscribed to the ml)
> 
> I have a ceph test cluster and a proxmox test cluster (for try upgrade in test
> before the prod).
> My ceph cluster is made up of three servers running debian 11, with two separate
> networks (cluster_network and public_network, in VLANs).
> In ceph version 16.2.10 (cephadm with docker).
> Each server has one MGR, one MON and 8 OSDs.
> cluster:
> id: xxx
> health: HEALTH_OK
> 
> services:
> mon: 3 daemons, quorum ceph01,ceph03,ceph02 (age 2h)
> mgr: ceph03(active, since 77m), standbys: ceph01, ceph02
> osd: 24 osds: 24 up (since 7w), 24 in (since 6M)
> 
> data:
> pools: 3 pools, 65 pgs
> objects: 29.13k objects, 113 GiB
> usage: 344 GiB used, 52 TiB / 52 TiB avail
> pgs: 65 active+clean
> 
> io:
> client: 1.3 KiB/s wr, 0 op/s rd, 0 op/s wr
> 
> The proxmox cluster is also made up of 3 servers running proxmox 7.2-7 (with
> proxmox ceph pacific which is on 16.2.9 version). The ceph storage used is RBD
> (on the ceph public_network). I added the RBD datastores simply via the GUI.
> 
> So far so good. I have several VMs, on each of the proxmox.
> 
> When I update ceph to 16.2.11, that's where things go wrong.
> I don't like when the update does everything for me without control, so I did a
> "staggered upgrade", following the official procedure
> (https://docs.ceph.com/en/pacific/cephadm/upgrade/#staggered-upgrade). As the
> version I'm starting from doesn't support staggered upgrade, I follow the
> procedure
> (https://docs.ceph.com/en/pacific/cephadm/upgrade/#upgrading-to-a-version-that-supports-staggered-upgrade-from-one-that-doesn-t).
> When I do the "ceph orch redeploy" of the two standby MGRs, everything is fine.
> I do the "sudo ceph mgr fail", everything is fine (it switches well to an mgr
> which was standby, so I get an MGR 16.2.11).
> However, when I do the "sudo ceph orch upgrade start --image
> quay.io/ceph/ceph:v16.2.11 --daemon-types mgr", it updates me the last MGR
> which was not updated (so far everything is still fine), but it does a last
> restart of all the MGRs to finish, and there, the proxmox visibly loses the RBD
> and turns off all my VMs.
> Here is the message in the proxmox syslog:
> Feb 2 16:20:52 pmox01 QEMU[436706]: terminate called after throwing an instance
> of 'std::system_error'
> Feb 2 16:20:52 pmox01 QEMU[436706]: what(): Resource deadlock avoided
> Feb 2 16:20:52 pmox01 kernel: [17038607.686686] vmbr0: port 2(tap102i0) entered
> disabled state
> Feb 2 16:20:52 pmox01 kernel: [17038607.779049] vmbr0: port 2(tap102i0) entered
> disabled state
> Feb 2 16:20:52 pmox01 systemd[1]: 102.scope: Succeeded.
> Feb 2 16:20:52 pmox01 systemd[1]: 102.scope: Consumed 43.136s CPU time.
> Feb 2 16:20:53 pmox01 qmeventd[446872]: Starting cleanup for 102
> Feb 2 16:20:53 pmox01 qmeventd[446872]: Finished cleanup for 102
> 
> For ceph, everything is fine, it does the update, and tells me everything is OK
> in the end.
> Ceph is now on 16.2.11 and the health is OK.
> 
> When I redo a downgrade of the MGRs, I have the problem again and when I start
> the procedure again, I still have the problem. It's very reproducible.
> According to my tests, the "sudo ceph orch upgrade" command always gives me
> trouble, even when trying a real staggered upgrade from and to version 16.2.11
> with the command:
> sudo ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.11 --daemon-types
> mgr --hosts ceph01 --limit 1
> 
> Does anyone have an idea?
> 
> Thank you everyone !
> Pierre.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx