Re: Upgrading from Pacific to Quincy fails with "Unexpected error"

Reza Bakhshayeshi <reza.b2008@xxxxxxxxx> · Wed, 12 Apr 2023 16:38:01 +0200

Thank you Adam for your response,

I tried all your comments and the troubleshooting link you sent. From the
Quincy mgrs containers, they can ssh into all other Pacific nodes
successfully by running the exact command in the log output and vice versa.

Here are some debug logs from the cephadm while updating:

2023-04-12T11:35:56.260958+0000 mgr.host8.jukgqm (mgr.4468627) 103 :
cephadm [DBG] Opening connection to cephadmin@x.x.x.x with ssh options '-F
/tmp/cephadm-conf-2bbfubub -i /tmp/cephadm-identity-7x2m8gvr'
2023-04-12T11:35:56.525091+0000 mgr.host8.jukgqm (mgr.4468627) 144 :
cephadm [DBG] _run_cephadm : command = ls
2023-04-12T11:35:56.525406+0000 mgr.host8.jukgqm (mgr.4468627) 145 :
cephadm [DBG] _run_cephadm : args = []
2023-04-12T11:35:56.525571+0000 mgr.host8.jukgqm (mgr.4468627) 146 :
cephadm [DBG] mon container image my-private-repo/quay-io/ceph/ceph@sha256
:1b9803c8984bef8b82f05e233e8fe8ed8f0bba8e5cc2c57f6efaccbeea682add
2023-04-12T11:35:56.525619+0000 mgr.host8.jukgqm (mgr.4468627) 147 :
cephadm [DBG] args: --image
my-private-repo/quay-io/ceph/ceph@sha256:1b9803c8984bef8b82f05e233e8fe8ed8f0bba8e5cc2c57f6efaccbeea682add
ls
2023-04-12T11:35:56.525738+0000 mgr.host8.jukgqm (mgr.4468627) 148 :
cephadm [DBG] Running command: sudo which python3
2023-04-12T11:35:56.534227+0000 mgr.host8.jukgqm (mgr.4468627) 149 :
cephadm [DBG] Connection to host1 failed. Process exited with non-zero exit
status 3
2023-04-12T11:35:56.534275+0000 mgr.host8.jukgqm (mgr.4468627) 150 :
cephadm [DBG] _reset_con close host1
2023-04-12T11:35:56.540135+0000 mgr.host8.jukgqm (mgr.4468627) 158 :
cephadm [DBG] Host "host1" marked as offline. Skipping gather facts refresh
2023-04-12T11:35:56.540178+0000 mgr.host8.jukgqm (mgr.4468627) 159 :
cephadm [DBG] Host "host1" marked as offline. Skipping network refresh
2023-04-12T11:35:56.540408+0000 mgr.host8.jukgqm (mgr.4468627) 160 :
cephadm [DBG] Host "host1" marked as offline. Skipping device refresh
2023-04-12T11:35:56.540490+0000 mgr.host8.jukgqm (mgr.4468627) 161 :
cephadm [DBG] Host "host1" marked as offline. Skipping osdspec preview
refresh
2023-04-12T11:35:56.540527+0000 mgr.host8.jukgqm (mgr.4468627) 162 :
cephadm [DBG] Host "host1" marked as offline. Skipping autotune
2023-04-12T11:35:56.540978+0000 mgr.host8.jukgqm (mgr.4468627) 163 :
cephadm [DBG] Connection to host1 failed. Process exited with non-zero exit
status 3
2023-04-12T11:35:56.796966+0000 mgr.host8.jukgqm (mgr.4468627) 728 :
cephadm [ERR] Upgrade: Paused due to UPGRADE_OFFLINE_HOST: Upgrade: Failed
to connect to host host1 at addr (x.x.x.x)

As I can see here, it turns out sudo is added to the code to be able to
continue:

https://github.com/ceph/ceph/blob/v17.2.5/src/pybind/mgr/cephadm/ssh.py#L143

I cannot privilege the cephadmin user to run sudo commands for some policy
reasons, could this be the root cause of the issue?

Best regards,
Reza

On Thu, 6 Apr 2023 at 14:59, Adam King <adking@xxxxxxxxxx> wrote:

> Does "ceph health detail" give any insight into what the unexpected
> exception was? If not, I'm pretty confident some traceback would end up
> being logged. Could maybe still grab it with "ceph log last 200 info
> cephadm" if not a lot else has happened. Also, probably need to find out if
> the check-host is failing due to the check on the host actually failing or
> failing to connect to the host. Could try putting a copy of the cephadm
> binary on one and running "cephadm check-host --expect-hostname <hostname>"
> where the hostname is the name cephadm knows the host by. If that's not an
> issue I'd expect it's a connection thing. Could maybe try going through
>  https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#ssh-errors
> <https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#ssh-errors>.
> Cephadm changed the backend ssh library from pacific to quincy due to the
> one used in pacific no longer being supported so it's possible some general
> ssh error has popped up in your env as a result.
>
> On Thu, Apr 6, 2023 at 8:38 AM Reza Bakhshayeshi <reza.b2008@xxxxxxxxx>
> wrote:
>
>> Hi all,
>>
>> I have a problem regarding upgrading Ceph cluster from Pacific to Quincy
>> version with cephadm. I have successfully upgraded the cluster to the
>> latest Pacific (16.2.11). But when I run the following command to upgrade
>> the cluster to 17.2.5, After upgrading 3/4 mgrs, the upgrade process stops
>> with "Unexpected error". (everything is on a private network)
>>
>> ceph orch upgrade start my-private-repo/quay-io/ceph/ceph:v17.2.5
>>
>> I also tried the 17.2.4 version.
>>
>> cephadm fails to check the hosts' status and marks them as offline:
>>
>> cephadm 2023-04-06T10:19:59.998510+0000 mgr.host9.arhpnd (mgr.4516356)
>> 5782
>> : cephadm [DBG]  host host4 (x.x.x.x) failed check
>> cephadm 2023-04-06T10:19:59.998553+0000 mgr.host9.arhpnd (mgr.4516356)
>> 5783
>> : cephadm [DBG] Host "host4" marked as offline. Skipping daemon refresh
>> cephadm 2023-04-06T10:19:59.998581+0000 mgr.host9.arhpnd (mgr.4516356)
>> 5784
>> : cephadm [DBG] Host "host4" marked as offline. Skipping gather facts
>> refresh
>> cephadm 2023-04-06T10:19:59.998609+0000 mgr.host9.arhpnd (mgr.4516356)
>> 5785
>> : cephadm [DBG] Host "host4" marked as offline. Skipping network refresh
>> cephadm 2023-04-06T10:19:59.998633+0000 mgr.host9.arhpnd (mgr.4516356)
>> 5786
>> : cephadm [DBG] Host "host4" marked as offline. Skipping device refresh
>> cephadm 2023-04-06T10:19:59.998659+0000 mgr.host9.arhpnd (mgr.4516356)
>> 5787
>> : cephadm [DBG] Host "host4" marked as offline. Skipping osdspec preview
>> refresh
>> cephadm 2023-04-06T10:19:59.998682+0000 mgr.host9.arhpnd (mgr.4516356)
>> 5788
>> : cephadm [DBG] Host "host4" marked as offline. Skipping autotune
>> cluster 2023-04-06T10:20:00.000151+0000 mon.host8 (mon.0) 158587 : cluster
>> [ERR] Health detail: HEALTH_ERR 9 hosts fail cephadm check; Upgrade:
>> failed
>> due to an unexpected exception
>> cluster 2023-04-06T10:20:00.000191+0000 mon.host8 (mon.0) 158588 : cluster
>> [ERR] [WRN] CEPHADM_HOST_CHECK_FAILED: 9 hosts fail cephadm check
>> cluster 2023-04-06T10:20:00.000202+0000 mon.host8 (mon.0) 158589 : cluster
>> [ERR]     host host7 (x.x.x.x) failed check: Unable to reach remote host
>> host7. Process exited with non-zero exit status 3
>> cluster 2023-04-06T10:20:00.000213+0000 mon.host8 (mon.0) 158590 : cluster
>> [ERR]     host host2 (x.x.x.x) failed check: Unable to reach remote host
>> host2. Process exited with non-zero exit status 3
>> cluster 2023-04-06T10:20:00.000220+0000 mon.host8 (mon.0) 158591 : cluster
>> [ERR]     host host8 (x.x.x.x) failed check: Unable to reach remote host
>> host8. Process exited with non-zero exit status 3
>> cluster 2023-04-06T10:20:00.000228+0000 mon.host8 (mon.0) 158592 : cluster
>> [ERR]     host host4 (x.x.x.x) failed check: Unable to reach remote host
>> host4. Process exited with non-zero exit status 3
>> cluster 2023-04-06T10:20:00.000240+0000 mon.host8 (mon.0) 158593 : cluster
>> [ERR]     host host3 (x.x.x.x) failed check: Unable to reach remote host
>> host3. Process exited with non-zero exit status 3
>>
>> and here are some outputs of the commands:
>>
>> [root@host8 ~]# ceph -s
>>   cluster:
>>     id:     xxx
>>     health: HEALTH_ERR
>>             9 hosts fail cephadm check
>>             Upgrade: failed due to an unexpected exception
>>
>>   services:
>>     mon: 5 daemons, quorum host8,host1,host7,host2,host9 (age 2w)
>>     mgr: host9.arhpnd(active, since 105m), standbys: host8.jowfih,
>> host1.warjsr, host2.qyavjj
>>     mds: 1/1 daemons up, 3 standby
>>     osd: 37 osds: 37 up (since 8h), 37 in (since 3w)
>>
>>   data:
>>
>>
>>   io:
>>     client:
>>
>>   progress:
>>     Upgrade to 17.2.5 (0s)
>>       [............................]
>>
>> [root@host8 ~]# ceph orch upgrade status
>> {
>>     "target_image": "my-private-repo/quay-io/ceph/ceph@sha256
>> :34c763383e3323c6bb35f3f2229af9f466518d9db926111277f5e27ed543c427",
>>     "in_progress": true,
>>     "which": "Upgrading all daemon types on all hosts",
>>     "services_complete": [],
>>     "progress": "3/59 daemons upgraded",
>>     "message": "Error: UPGRADE_EXCEPTION: Upgrade: failed due to an
>> unexpected exception",
>>     "is_paused": true
>> }
>> [root@host8 ~]# ceph cephadm check-host host7
>> check-host failed:
>> Host 'host7' not found. Use 'ceph orch host ls' to see all managed hosts.
>> [root@host8 ~]# ceph versions
>> {
>>     "mon": {
>>         "ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
>> pacific (stable)": 5
>>     },
>>     "mgr": {
>>         "ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
>> pacific (stable)": 1,
>>         "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
>> quincy (stable)": 3
>>     },
>>     "osd": {
>>         "ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
>> pacific (stable)": 37
>>     },
>>     "mds": {
>>         "ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
>> pacific (stable)": 4
>>     },
>>     "overall": {
>>         "ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
>> pacific (stable)": 47,
>>         "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
>> quincy (stable)": 3
>>     }
>> }
>>
>> The strange thing is I can rollback the cluster status by failing to
>> not-upgraded mgr like this:
>>
>> ceph mgr fail
>> ceph orch upgrade start my-private-repo/quay-io/ceph/ceph:v16.2.11
>>
>> Would you happen to have any idea about this?
>>
>> Best regards,
>> Reza
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx