Re: Upgrading from Pacific to Quincy fails with "Unexpected error"

Adam King <adking@xxxxxxxxxx> · Wed, 12 Apr 2023 11:18:31 -0400

Ah, okay. Someone else had opened an issue about the same thing after
the 17.2.5 release I believe. It's changed in 17.2.6 at least to only use
sudo for non-root users
https://github.com/ceph/ceph/blob/v17.2.6/src/pybind/mgr/cephadm/ssh.py#L148-L153.
But it looks like you're also using a non-root user anyway. We've required
passwordless sudo access for custom ssh users for a long time I think (e.g.
it's in pacific docs
https://docs.ceph.com/en/pacific/cephadm/install/#further-information-about-cephadm-bootstrap,
see the point on "--ssh-user"). Did this actually work for you before in
pacific with a non-root user that doesn't have sudo privileges? I had
assumed that had never worked.

On Wed, Apr 12, 2023 at 10:38 AM Reza Bakhshayeshi <reza.b2008@xxxxxxxxx>
wrote:

> Thank you Adam for your response,
>
> I tried all your comments and the troubleshooting link you sent. From the
> Quincy mgrs containers, they can ssh into all other Pacific nodes
> successfully by running the exact command in the log output and vice versa.
>
> Here are some debug logs from the cephadm while updating:
>
> 2023-04-12T11:35:56.260958+0000 mgr.host8.jukgqm (mgr.4468627) 103 :
> cephadm [DBG] Opening connection to cephadmin@x.x.x.x with ssh options
> '-F /tmp/cephadm-conf-2bbfubub -i /tmp/cephadm-identity-7x2m8gvr'
> 2023-04-12T11:35:56.525091+0000 mgr.host8.jukgqm (mgr.4468627) 144 :
> cephadm [DBG] _run_cephadm : command = ls
> 2023-04-12T11:35:56.525406+0000 mgr.host8.jukgqm (mgr.4468627) 145 :
> cephadm [DBG] _run_cephadm : args = []
> 2023-04-12T11:35:56.525571+0000 mgr.host8.jukgqm (mgr.4468627) 146 :
> cephadm [DBG] mon container image my-private-repo/quay-io/ceph/ceph@sha256
> :1b9803c8984bef8b82f05e233e8fe8ed8f0bba8e5cc2c57f6efaccbeea682add
> 2023-04-12T11:35:56.525619+0000 mgr.host8.jukgqm (mgr.4468627) 147 :
> cephadm [DBG] args: --image my-private-repo/quay-io/ceph/ceph@sha256:1b9803c8984bef8b82f05e233e8fe8ed8f0bba8e5cc2c57f6efaccbeea682add
> ls
> 2023-04-12T11:35:56.525738+0000 mgr.host8.jukgqm (mgr.4468627) 148 :
> cephadm [DBG] Running command: sudo which python3
> 2023-04-12T11:35:56.534227+0000 mgr.host8.jukgqm (mgr.4468627) 149 :
> cephadm [DBG] Connection to host1 failed. Process exited with non-zero exit
> status 3
> 2023-04-12T11:35:56.534275+0000 mgr.host8.jukgqm (mgr.4468627) 150 :
> cephadm [DBG] _reset_con close host1
> 2023-04-12T11:35:56.540135+0000 mgr.host8.jukgqm (mgr.4468627) 158 :
> cephadm [DBG] Host "host1" marked as offline. Skipping gather facts refresh
> 2023-04-12T11:35:56.540178+0000 mgr.host8.jukgqm (mgr.4468627) 159 :
> cephadm [DBG] Host "host1" marked as offline. Skipping network refresh
> 2023-04-12T11:35:56.540408+0000 mgr.host8.jukgqm (mgr.4468627) 160 :
> cephadm [DBG] Host "host1" marked as offline. Skipping device refresh
> 2023-04-12T11:35:56.540490+0000 mgr.host8.jukgqm (mgr.4468627) 161 :
> cephadm [DBG] Host "host1" marked as offline. Skipping osdspec preview
> refresh
> 2023-04-12T11:35:56.540527+0000 mgr.host8.jukgqm (mgr.4468627) 162 :
> cephadm [DBG] Host "host1" marked as offline. Skipping autotune
> 2023-04-12T11:35:56.540978+0000 mgr.host8.jukgqm (mgr.4468627) 163 :
> cephadm [DBG] Connection to host1 failed. Process exited with non-zero exit
> status 3
> 2023-04-12T11:35:56.796966+0000 mgr.host8.jukgqm (mgr.4468627) 728 :
> cephadm [ERR] Upgrade: Paused due to UPGRADE_OFFLINE_HOST: Upgrade: Failed
> to connect to host host1 at addr (x.x.x.x)
>
> As I can see here, it turns out sudo is added to the code to be able to
> continue:
>
>
> https://github.com/ceph/ceph/blob/v17.2.5/src/pybind/mgr/cephadm/ssh.py#L143
>
> I cannot privilege the cephadmin user to run sudo commands for some policy
> reasons, could this be the root cause of the issue?
>
> Best regards,
> Reza
>
> On Thu, 6 Apr 2023 at 14:59, Adam King <adking@xxxxxxxxxx> wrote:
>
>> Does "ceph health detail" give any insight into what the unexpected
>> exception was? If not, I'm pretty confident some traceback would end up
>> being logged. Could maybe still grab it with "ceph log last 200 info
>> cephadm" if not a lot else has happened. Also, probably need to find out if
>> the check-host is failing due to the check on the host actually failing or
>> failing to connect to the host. Could try putting a copy of the cephadm
>> binary on one and running "cephadm check-host --expect-hostname <hostname>"
>> where the hostname is the name cephadm knows the host by. If that's not an
>> issue I'd expect it's a connection thing. Could maybe try going through
>>  https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#ssh-errors
>> <https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#ssh-errors>.
>> Cephadm changed the backend ssh library from pacific to quincy due to the
>> one used in pacific no longer being supported so it's possible some general
>> ssh error has popped up in your env as a result.
>>
>> On Thu, Apr 6, 2023 at 8:38 AM Reza Bakhshayeshi <reza.b2008@xxxxxxxxx>
>> wrote:
>>
>>> Hi all,
>>>
>>> I have a problem regarding upgrading Ceph cluster from Pacific to Quincy
>>> version with cephadm. I have successfully upgraded the cluster to the
>>> latest Pacific (16.2.11). But when I run the following command to upgrade
>>> the cluster to 17.2.5, After upgrading 3/4 mgrs, the upgrade process
>>> stops
>>> with "Unexpected error". (everything is on a private network)
>>>
>>> ceph orch upgrade start my-private-repo/quay-io/ceph/ceph:v17.2.5
>>>
>>> I also tried the 17.2.4 version.
>>>
>>> cephadm fails to check the hosts' status and marks them as offline:
>>>
>>> cephadm 2023-04-06T10:19:59.998510+0000 mgr.host9.arhpnd (mgr.4516356)
>>> 5782
>>> : cephadm [DBG]  host host4 (x.x.x.x) failed check
>>> cephadm 2023-04-06T10:19:59.998553+0000 mgr.host9.arhpnd (mgr.4516356)
>>> 5783
>>> : cephadm [DBG] Host "host4" marked as offline. Skipping daemon refresh
>>> cephadm 2023-04-06T10:19:59.998581+0000 mgr.host9.arhpnd (mgr.4516356)
>>> 5784
>>> : cephadm [DBG] Host "host4" marked as offline. Skipping gather facts
>>> refresh
>>> cephadm 2023-04-06T10:19:59.998609+0000 mgr.host9.arhpnd (mgr.4516356)
>>> 5785
>>> : cephadm [DBG] Host "host4" marked as offline. Skipping network refresh
>>> cephadm 2023-04-06T10:19:59.998633+0000 mgr.host9.arhpnd (mgr.4516356)
>>> 5786
>>> : cephadm [DBG] Host "host4" marked as offline. Skipping device refresh
>>> cephadm 2023-04-06T10:19:59.998659+0000 mgr.host9.arhpnd (mgr.4516356)
>>> 5787
>>> : cephadm [DBG] Host "host4" marked as offline. Skipping osdspec preview
>>> refresh
>>> cephadm 2023-04-06T10:19:59.998682+0000 mgr.host9.arhpnd (mgr.4516356)
>>> 5788
>>> : cephadm [DBG] Host "host4" marked as offline. Skipping autotune
>>> cluster 2023-04-06T10:20:00.000151+0000 mon.host8 (mon.0) 158587 :
>>> cluster
>>> [ERR] Health detail: HEALTH_ERR 9 hosts fail cephadm check; Upgrade:
>>> failed
>>> due to an unexpected exception
>>> cluster 2023-04-06T10:20:00.000191+0000 mon.host8 (mon.0) 158588 :
>>> cluster
>>> [ERR] [WRN] CEPHADM_HOST_CHECK_FAILED: 9 hosts fail cephadm check
>>> cluster 2023-04-06T10:20:00.000202+0000 mon.host8 (mon.0) 158589 :
>>> cluster
>>> [ERR]     host host7 (x.x.x.x) failed check: Unable to reach remote host
>>> host7. Process exited with non-zero exit status 3
>>> cluster 2023-04-06T10:20:00.000213+0000 mon.host8 (mon.0) 158590 :
>>> cluster
>>> [ERR]     host host2 (x.x.x.x) failed check: Unable to reach remote host
>>> host2. Process exited with non-zero exit status 3
>>> cluster 2023-04-06T10:20:00.000220+0000 mon.host8 (mon.0) 158591 :
>>> cluster
>>> [ERR]     host host8 (x.x.x.x) failed check: Unable to reach remote host
>>> host8. Process exited with non-zero exit status 3
>>> cluster 2023-04-06T10:20:00.000228+0000 mon.host8 (mon.0) 158592 :
>>> cluster
>>> [ERR]     host host4 (x.x.x.x) failed check: Unable to reach remote host
>>> host4. Process exited with non-zero exit status 3
>>> cluster 2023-04-06T10:20:00.000240+0000 mon.host8 (mon.0) 158593 :
>>> cluster
>>> [ERR]     host host3 (x.x.x.x) failed check: Unable to reach remote host
>>> host3. Process exited with non-zero exit status 3
>>>
>>> and here are some outputs of the commands:
>>>
>>> [root@host8 ~]# ceph -s
>>>   cluster:
>>>     id:     xxx
>>>     health: HEALTH_ERR
>>>             9 hosts fail cephadm check
>>>             Upgrade: failed due to an unexpected exception
>>>
>>>   services:
>>>     mon: 5 daemons, quorum host8,host1,host7,host2,host9 (age 2w)
>>>     mgr: host9.arhpnd(active, since 105m), standbys: host8.jowfih,
>>> host1.warjsr, host2.qyavjj
>>>     mds: 1/1 daemons up, 3 standby
>>>     osd: 37 osds: 37 up (since 8h), 37 in (since 3w)
>>>
>>>   data:
>>>
>>>
>>>   io:
>>>     client:
>>>
>>>   progress:
>>>     Upgrade to 17.2.5 (0s)
>>>       [............................]
>>>
>>> [root@host8 ~]# ceph orch upgrade status
>>> {
>>>     "target_image": "my-private-repo/quay-io/ceph/ceph@sha256
>>> :34c763383e3323c6bb35f3f2229af9f466518d9db926111277f5e27ed543c427",
>>>     "in_progress": true,
>>>     "which": "Upgrading all daemon types on all hosts",
>>>     "services_complete": [],
>>>     "progress": "3/59 daemons upgraded",
>>>     "message": "Error: UPGRADE_EXCEPTION: Upgrade: failed due to an
>>> unexpected exception",
>>>     "is_paused": true
>>> }
>>> [root@host8 ~]# ceph cephadm check-host host7
>>> check-host failed:
>>> Host 'host7' not found. Use 'ceph orch host ls' to see all managed hosts.
>>> [root@host8 ~]# ceph versions
>>> {
>>>     "mon": {
>>>         "ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
>>> pacific (stable)": 5
>>>     },
>>>     "mgr": {
>>>         "ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
>>> pacific (stable)": 1,
>>>         "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
>>> quincy (stable)": 3
>>>     },
>>>     "osd": {
>>>         "ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
>>> pacific (stable)": 37
>>>     },
>>>     "mds": {
>>>         "ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
>>> pacific (stable)": 4
>>>     },
>>>     "overall": {
>>>         "ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
>>> pacific (stable)": 47,
>>>         "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
>>> quincy (stable)": 3
>>>     }
>>> }
>>>
>>> The strange thing is I can rollback the cluster status by failing to
>>> not-upgraded mgr like this:
>>>
>>> ceph mgr fail
>>> ceph orch upgrade start my-private-repo/quay-io/ceph/ceph:v16.2.11
>>>
>>> Would you happen to have any idea about this?
>>>
>>> Best regards,
>>> Reza
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>
>>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx