Hi all, I have a problem regarding upgrading Ceph cluster from Pacific to Quincy version with cephadm. I have successfully upgraded the cluster to the latest Pacific (16.2.11). But when I run the following command to upgrade the cluster to 17.2.5, After upgrading 3/4 mgrs, the upgrade process stops with "Unexpected error". (everything is on a private network) ceph orch upgrade start my-private-repo/quay-io/ceph/ceph:v17.2.5 I also tried the 17.2.4 version. cephadm fails to check the hosts' status and marks them as offline: cephadm 2023-04-06T10:19:59.998510+0000 mgr.host9.arhpnd (mgr.4516356) 5782 : cephadm [DBG] host host4 (x.x.x.x) failed check cephadm 2023-04-06T10:19:59.998553+0000 mgr.host9.arhpnd (mgr.4516356) 5783 : cephadm [DBG] Host "host4" marked as offline. Skipping daemon refresh cephadm 2023-04-06T10:19:59.998581+0000 mgr.host9.arhpnd (mgr.4516356) 5784 : cephadm [DBG] Host "host4" marked as offline. Skipping gather facts refresh cephadm 2023-04-06T10:19:59.998609+0000 mgr.host9.arhpnd (mgr.4516356) 5785 : cephadm [DBG] Host "host4" marked as offline. Skipping network refresh cephadm 2023-04-06T10:19:59.998633+0000 mgr.host9.arhpnd (mgr.4516356) 5786 : cephadm [DBG] Host "host4" marked as offline. Skipping device refresh cephadm 2023-04-06T10:19:59.998659+0000 mgr.host9.arhpnd (mgr.4516356) 5787 : cephadm [DBG] Host "host4" marked as offline. Skipping osdspec preview refresh cephadm 2023-04-06T10:19:59.998682+0000 mgr.host9.arhpnd (mgr.4516356) 5788 : cephadm [DBG] Host "host4" marked as offline. Skipping autotune cluster 2023-04-06T10:20:00.000151+0000 mon.host8 (mon.0) 158587 : cluster [ERR] Health detail: HEALTH_ERR 9 hosts fail cephadm check; Upgrade: failed due to an unexpected exception cluster 2023-04-06T10:20:00.000191+0000 mon.host8 (mon.0) 158588 : cluster [ERR] [WRN] CEPHADM_HOST_CHECK_FAILED: 9 hosts fail cephadm check cluster 2023-04-06T10:20:00.000202+0000 mon.host8 (mon.0) 158589 : cluster [ERR] host host7 (x.x.x.x) failed check: Unable to reach remote host host7. Process exited with non-zero exit status 3 cluster 2023-04-06T10:20:00.000213+0000 mon.host8 (mon.0) 158590 : cluster [ERR] host host2 (x.x.x.x) failed check: Unable to reach remote host host2. Process exited with non-zero exit status 3 cluster 2023-04-06T10:20:00.000220+0000 mon.host8 (mon.0) 158591 : cluster [ERR] host host8 (x.x.x.x) failed check: Unable to reach remote host host8. Process exited with non-zero exit status 3 cluster 2023-04-06T10:20:00.000228+0000 mon.host8 (mon.0) 158592 : cluster [ERR] host host4 (x.x.x.x) failed check: Unable to reach remote host host4. Process exited with non-zero exit status 3 cluster 2023-04-06T10:20:00.000240+0000 mon.host8 (mon.0) 158593 : cluster [ERR] host host3 (x.x.x.x) failed check: Unable to reach remote host host3. Process exited with non-zero exit status 3 and here are some outputs of the commands: [root@host8 ~]# ceph -s cluster: id: xxx health: HEALTH_ERR 9 hosts fail cephadm check Upgrade: failed due to an unexpected exception services: mon: 5 daemons, quorum host8,host1,host7,host2,host9 (age 2w) mgr: host9.arhpnd(active, since 105m), standbys: host8.jowfih, host1.warjsr, host2.qyavjj mds: 1/1 daemons up, 3 standby osd: 37 osds: 37 up (since 8h), 37 in (since 3w) data: io: client: progress: Upgrade to 17.2.5 (0s) [............................] [root@host8 ~]# ceph orch upgrade status { "target_image": "my-private-repo/quay-io/ceph/ceph@sha256 :34c763383e3323c6bb35f3f2229af9f466518d9db926111277f5e27ed543c427", "in_progress": true, "which": "Upgrading all daemon types on all hosts", "services_complete": [], "progress": "3/59 daemons upgraded", "message": "Error: UPGRADE_EXCEPTION: Upgrade: failed due to an unexpected exception", "is_paused": true } [root@host8 ~]# ceph cephadm check-host host7 check-host failed: Host 'host7' not found. Use 'ceph orch host ls' to see all managed hosts. [root@host8 ~]# ceph versions { "mon": { "ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894) pacific (stable)": 5 }, "mgr": { "ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894) pacific (stable)": 1, "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 3 }, "osd": { "ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894) pacific (stable)": 37 }, "mds": { "ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894) pacific (stable)": 4 }, "overall": { "ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894) pacific (stable)": 47, "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 3 } } The strange thing is I can rollback the cluster status by failing to not-upgraded mgr like this: ceph mgr fail ceph orch upgrade start my-private-repo/quay-io/ceph/ceph:v16.2.11 Would you happen to have any idea about this? Best regards, Reza _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx