Hi everyone, I have a ceph test cluster and a proxmox test cluster (for try upgrade in test before the prod). My ceph cluster is made up of three servers running debian 11, with two separate networks (cluster_network and public_network, in VLANs). In ceph version 16.2.10 (cephadm with docker). Each server has one MGR, one MON and 8 OSDs. cluster: id: xxx health: HEALTH_OK services: mon: 3 daemons, quorum ceph01,ceph03,ceph02 (age 2h) mgr: ceph03(active, since 77m), standbys: ceph01, ceph02 osd: 24 osds: 24 up (since 7w), 24 in (since 6M) data: pools: 3 pools, 65 pgs objects: 29.13k objects, 113 GiB usage: 344 GiB used, 52 TiB / 52 TiB avail pgs: 65 active+clean io: client: 1.3 KiB/s wr, 0 op/s rd, 0 op/s wr The proxmox cluster is also made up of 3 servers running proxmox 7.2-7. The ceph storage used is RBD (on the ceph public_network). I added the RBD datastores simply via the GUI. So far so good. I have several VMs, on each of the proxmox. When I update ceph to 16.2.11, that's where things go wrong. I don't like when the update does everything for me without control, so I did a "staggered upgrade", following the official procedure (https://docs.ceph.com/en/pacific/cephadm/upgrade/#staggered-upgrade). As the version I'm starting from doesn't support staggered upgrade, I follow the procedure (https://docs.ceph.com/en/pacific/cephadm/upgrade/#upgrading-to-a-version-that-supports-staggered-upgrade-from-one-that-doesn-t). When I do the "ceph orch redeploy" of the two standby MGRs, everything is fine. I do the "sudo ceph mgr fail", everything is fine (it switches well to an mgr which was standby, so I get an MGR 16.2.11). However, when I do the "sudo ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.11 --daemon-types mgr", it updates me the last MGR which was not updated (so far everything is still fine), but it does a last restart of all the MGRs to finish, and there, the proxmox visibly loses the RBD and turns off all my VMs. Here is the message in the proxmox syslog: Feb 2 16:20:52 pmox01 QEMU[436706]: terminate called after throwing an instance of 'std::system_error' Feb 2 16:20:52 pmox01 QEMU[436706]: what(): Resource deadlock avoided Feb 2 16:20:52 pmox01 kernel: [17038607.686686] vmbr0: port 2(tap102i0) entered disabled state Feb 2 16:20:52 pmox01 kernel: [17038607.779049] vmbr0: port 2(tap102i0) entered disabled state Feb 2 16:20:52 pmox01 systemd[1]: 102.scope: Succeeded. Feb 2 16:20:52 pmox01 systemd[1]: 102.scope: Consumed 43.136s CPU time. Feb 2 16:20:53 pmox01 qmeventd[446872]: Starting cleanup for 102 Feb 2 16:20:53 pmox01 qmeventd[446872]: Finished cleanup for 102 For ceph, everything is fine, it does the update, and tells me everything is OK in the end. Ceph is now on 16.2.11 and the health is OK. When I redo a downgrade of the MGRs, I have the problem again and when I start the procedure again, I still have the problem. It's very reproducible. According to my tests, the "sudo ceph orch upgrade" command always gives me trouble, even when trying a real staggered upgrade from and to version 16.2.11 with the command: sudo ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.11 --daemon-types mgr --hosts ceph01 --limit 1 Does anyone have an idea? Thank you everyone ! Pierre. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx