Re: Upgrading from Pacific to Quincy fails with "Unexpected error"

Adam King <adking@xxxxxxxxxx> · Thu, 6 Apr 2023 08:59:30 -0400

Does "ceph health detail" give any insight into what the unexpected
exception was? If not, I'm pretty confident some traceback would end up
being logged. Could maybe still grab it with "ceph log last 200 info
cephadm" if not a lot else has happened. Also, probably need to find out if
the check-host is failing due to the check on the host actually failing or
failing to connect to the host. Could try putting a copy of the cephadm
binary on one and running "cephadm check-host --expect-hostname <hostname>"
where the hostname is the name cephadm knows the host by. If that's not an
issue I'd expect it's a connection thing. Could maybe try going through
 https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#ssh-errors
<https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#ssh-errors>.
Cephadm changed the backend ssh library from pacific to quincy due to the
one used in pacific no longer being supported so it's possible some general
ssh error has popped up in your env as a result.

On Thu, Apr 6, 2023 at 8:38 AM Reza Bakhshayeshi <reza.b2008@xxxxxxxxx>
wrote:

> Hi all,
>
> I have a problem regarding upgrading Ceph cluster from Pacific to Quincy
> version with cephadm. I have successfully upgraded the cluster to the
> latest Pacific (16.2.11). But when I run the following command to upgrade
> the cluster to 17.2.5, After upgrading 3/4 mgrs, the upgrade process stops
> with "Unexpected error". (everything is on a private network)
>
> ceph orch upgrade start my-private-repo/quay-io/ceph/ceph:v17.2.5
>
> I also tried the 17.2.4 version.
>
> cephadm fails to check the hosts' status and marks them as offline:
>
> cephadm 2023-04-06T10:19:59.998510+0000 mgr.host9.arhpnd (mgr.4516356) 5782
> : cephadm [DBG]  host host4 (x.x.x.x) failed check
> cephadm 2023-04-06T10:19:59.998553+0000 mgr.host9.arhpnd (mgr.4516356) 5783
> : cephadm [DBG] Host "host4" marked as offline. Skipping daemon refresh
> cephadm 2023-04-06T10:19:59.998581+0000 mgr.host9.arhpnd (mgr.4516356) 5784
> : cephadm [DBG] Host "host4" marked as offline. Skipping gather facts
> refresh
> cephadm 2023-04-06T10:19:59.998609+0000 mgr.host9.arhpnd (mgr.4516356) 5785
> : cephadm [DBG] Host "host4" marked as offline. Skipping network refresh
> cephadm 2023-04-06T10:19:59.998633+0000 mgr.host9.arhpnd (mgr.4516356) 5786
> : cephadm [DBG] Host "host4" marked as offline. Skipping device refresh
> cephadm 2023-04-06T10:19:59.998659+0000 mgr.host9.arhpnd (mgr.4516356) 5787
> : cephadm [DBG] Host "host4" marked as offline. Skipping osdspec preview
> refresh
> cephadm 2023-04-06T10:19:59.998682+0000 mgr.host9.arhpnd (mgr.4516356) 5788
> : cephadm [DBG] Host "host4" marked as offline. Skipping autotune
> cluster 2023-04-06T10:20:00.000151+0000 mon.host8 (mon.0) 158587 : cluster
> [ERR] Health detail: HEALTH_ERR 9 hosts fail cephadm check; Upgrade: failed
> due to an unexpected exception
> cluster 2023-04-06T10:20:00.000191+0000 mon.host8 (mon.0) 158588 : cluster
> [ERR] [WRN] CEPHADM_HOST_CHECK_FAILED: 9 hosts fail cephadm check
> cluster 2023-04-06T10:20:00.000202+0000 mon.host8 (mon.0) 158589 : cluster
> [ERR]     host host7 (x.x.x.x) failed check: Unable to reach remote host
> host7. Process exited with non-zero exit status 3
> cluster 2023-04-06T10:20:00.000213+0000 mon.host8 (mon.0) 158590 : cluster
> [ERR]     host host2 (x.x.x.x) failed check: Unable to reach remote host
> host2. Process exited with non-zero exit status 3
> cluster 2023-04-06T10:20:00.000220+0000 mon.host8 (mon.0) 158591 : cluster
> [ERR]     host host8 (x.x.x.x) failed check: Unable to reach remote host
> host8. Process exited with non-zero exit status 3
> cluster 2023-04-06T10:20:00.000228+0000 mon.host8 (mon.0) 158592 : cluster
> [ERR]     host host4 (x.x.x.x) failed check: Unable to reach remote host
> host4. Process exited with non-zero exit status 3
> cluster 2023-04-06T10:20:00.000240+0000 mon.host8 (mon.0) 158593 : cluster
> [ERR]     host host3 (x.x.x.x) failed check: Unable to reach remote host
> host3. Process exited with non-zero exit status 3
>
> and here are some outputs of the commands:
>
> [root@host8 ~]# ceph -s
>   cluster:
>     id:     xxx
>     health: HEALTH_ERR
>             9 hosts fail cephadm check
>             Upgrade: failed due to an unexpected exception
>
>   services:
>     mon: 5 daemons, quorum host8,host1,host7,host2,host9 (age 2w)
>     mgr: host9.arhpnd(active, since 105m), standbys: host8.jowfih,
> host1.warjsr, host2.qyavjj
>     mds: 1/1 daemons up, 3 standby
>     osd: 37 osds: 37 up (since 8h), 37 in (since 3w)
>
>   data:
>
>
>   io:
>     client:
>
>   progress:
>     Upgrade to 17.2.5 (0s)
>       [............................]
>
> [root@host8 ~]# ceph orch upgrade status
> {
>     "target_image": "my-private-repo/quay-io/ceph/ceph@sha256
> :34c763383e3323c6bb35f3f2229af9f466518d9db926111277f5e27ed543c427",
>     "in_progress": true,
>     "which": "Upgrading all daemon types on all hosts",
>     "services_complete": [],
>     "progress": "3/59 daemons upgraded",
>     "message": "Error: UPGRADE_EXCEPTION: Upgrade: failed due to an
> unexpected exception",
>     "is_paused": true
> }
> [root@host8 ~]# ceph cephadm check-host host7
> check-host failed:
> Host 'host7' not found. Use 'ceph orch host ls' to see all managed hosts.
> [root@host8 ~]# ceph versions
> {
>     "mon": {
>         "ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
> pacific (stable)": 5
>     },
>     "mgr": {
>         "ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
> pacific (stable)": 1,
>         "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
> quincy (stable)": 3
>     },
>     "osd": {
>         "ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
> pacific (stable)": 37
>     },
>     "mds": {
>         "ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
> pacific (stable)": 4
>     },
>     "overall": {
>         "ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
> pacific (stable)": 47,
>         "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
> quincy (stable)": 3
>     }
> }
>
> The strange thing is I can rollback the cluster status by failing to
> not-upgraded mgr like this:
>
> ceph mgr fail
> ceph orch upgrade start my-private-repo/quay-io/ceph/ceph:v16.2.11
>
> Would you happen to have any idea about this?
>
> Best regards,
> Reza
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx