Re: Upgrading from Pacific to Quincy fails with "Unexpected error"

Adam King <adking@xxxxxxxxxx> · Tue, 2 May 2023 13:00:39 -0400

The number of mgr daemons thing is expected. The way it works is it first
upgrades all the standby mgrs (which will be all but one) and then fails
over so the previously active mgr can be upgraded as well. After that
failover is when it's first actually running the newer cephadm code, which
is when you're hitting this issue. Are the logs still saying something
similar about how "sudo which python3" is failing? I'm thinking this might
just be a general issue with the user being used not having passwordless
sudo access, that sort of accidentally working in pacific, but now not
working any more in quincy. If the log lines confirm the same, we might
have to work on something in order to handle this case (making the sudo
optional somehow). As mentioned in the previous email, that setup wasn't
intended to be supported even in pacific, although if it did work, we could
bring something in to make it usable in quincy onward as well.

On Tue, May 2, 2023 at 10:58 AM Reza Bakhshayeshi <reza.b2008@xxxxxxxxx>
wrote:

> Hi Adam,
>
> I'm still struggling with this issue. I also checked it one more time with
> newer versions, upgrading the cluster from 16.2.11 to 16.2.12 was
> successful but from 16.2.12 to 17.2.6 failed again with the same ssh errors
> (I checked
> https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#ssh-errors a
> couple of times and all keys/access are fine).
>
> [root@host1 ~]# ceph health detail
> HEALTH_ERR Upgrade: Failed to connect to host host2 at addr (x.x.x.x)
> [ERR] UPGRADE_OFFLINE_HOST: Upgrade: Failed to connect to host host2 at
> addr (x.x.x.x)
>     SSH connection failed to host2 at addr (x.x.x.x): Host(s) were marked
> offline: {'host2', 'host6', 'host9', 'host4', 'host3', 'host5', 'host1',
> 'host7', 'host8'}
>
> The interesting thing is that always (total number of mgrs) - 1 is
> upgraded, If I provision 5 MGRs then 4 of them, and for 3, 2 of them!
>
> As long as I'm in an internal environment, I also checked the process with
> Quincy cephadm binary file. FYI I'm using stretch mode on this cluster.
>
> I don't understand why Quincy MGRs cannot ssh into Pacific nodes, if you
> have any more hints I would be really glad to hear.
>
> Best regards,
> Reza
>
>
>
> On Wed, 12 Apr 2023 at 17:18, Adam King <adking@xxxxxxxxxx> wrote:
>
>> Ah, okay. Someone else had opened an issue about the same thing after
>> the 17.2.5 release I believe. It's changed in 17.2.6 at least to only use
>> sudo for non-root users
>> https://github.com/ceph/ceph/blob/v17.2.6/src/pybind/mgr/cephadm/ssh.py#L148-L153.
>> But it looks like you're also using a non-root user anyway. We've required
>> passwordless sudo access for custom ssh users for a long time I think (e.g.
>> it's in pacific docs
>> https://docs.ceph.com/en/pacific/cephadm/install/#further-information-about-cephadm-bootstrap,
>> see the point on "--ssh-user"). Did this actually work for you before in
>> pacific with a non-root user that doesn't have sudo privileges? I had
>> assumed that had never worked.
>>
>> On Wed, Apr 12, 2023 at 10:38 AM Reza Bakhshayeshi <reza.b2008@xxxxxxxxx>
>> wrote:
>>
>>> Thank you Adam for your response,
>>>
>>> I tried all your comments and the troubleshooting link you sent. From
>>> the Quincy mgrs containers, they can ssh into all other Pacific nodes
>>> successfully by running the exact command in the log output and vice versa.
>>>
>>> Here are some debug logs from the cephadm while updating:
>>>
>>> 2023-04-12T11:35:56.260958+0000 mgr.host8.jukgqm (mgr.4468627) 103 :
>>> cephadm [DBG] Opening connection to cephadmin@x.x.x.x with ssh options
>>> '-F /tmp/cephadm-conf-2bbfubub -i /tmp/cephadm-identity-7x2m8gvr'
>>> 2023-04-12T11:35:56.525091+0000 mgr.host8.jukgqm (mgr.4468627) 144 :
>>> cephadm [DBG] _run_cephadm : command = ls
>>> 2023-04-12T11:35:56.525406+0000 mgr.host8.jukgqm (mgr.4468627) 145 :
>>> cephadm [DBG] _run_cephadm : args = []
>>> 2023-04-12T11:35:56.525571+0000 mgr.host8.jukgqm (mgr.4468627) 146 :
>>> cephadm [DBG] mon container image my-private-repo/quay-io/ceph/ceph@sha256
>>> :1b9803c8984bef8b82f05e233e8fe8ed8f0bba8e5cc2c57f6efaccbeea682add
>>> 2023-04-12T11:35:56.525619+0000 mgr.host8.jukgqm (mgr.4468627) 147 :
>>> cephadm [DBG] args: --image my-private-repo/quay-io/ceph/ceph@sha256:1b9803c8984bef8b82f05e233e8fe8ed8f0bba8e5cc2c57f6efaccbeea682add
>>> ls
>>> 2023-04-12T11:35:56.525738+0000 mgr.host8.jukgqm (mgr.4468627) 148 :
>>> cephadm [DBG] Running command: sudo which python3
>>> 2023-04-12T11:35:56.534227+0000 mgr.host8.jukgqm (mgr.4468627) 149 :
>>> cephadm [DBG] Connection to host1 failed. Process exited with non-zero exit
>>> status 3
>>> 2023-04-12T11:35:56.534275+0000 mgr.host8.jukgqm (mgr.4468627) 150 :
>>> cephadm [DBG] _reset_con close host1
>>> 2023-04-12T11:35:56.540135+0000 mgr.host8.jukgqm (mgr.4468627) 158 :
>>> cephadm [DBG] Host "host1" marked as offline. Skipping gather facts refresh
>>> 2023-04-12T11:35:56.540178+0000 mgr.host8.jukgqm (mgr.4468627) 159 :
>>> cephadm [DBG] Host "host1" marked as offline. Skipping network refresh
>>> 2023-04-12T11:35:56.540408+0000 mgr.host8.jukgqm (mgr.4468627) 160 :
>>> cephadm [DBG] Host "host1" marked as offline. Skipping device refresh
>>> 2023-04-12T11:35:56.540490+0000 mgr.host8.jukgqm (mgr.4468627) 161 :
>>> cephadm [DBG] Host "host1" marked as offline. Skipping osdspec preview
>>> refresh
>>> 2023-04-12T11:35:56.540527+0000 mgr.host8.jukgqm (mgr.4468627) 162 :
>>> cephadm [DBG] Host "host1" marked as offline. Skipping autotune
>>> 2023-04-12T11:35:56.540978+0000 mgr.host8.jukgqm (mgr.4468627) 163 :
>>> cephadm [DBG] Connection to host1 failed. Process exited with non-zero exit
>>> status 3
>>> 2023-04-12T11:35:56.796966+0000 mgr.host8.jukgqm (mgr.4468627) 728 :
>>> cephadm [ERR] Upgrade: Paused due to UPGRADE_OFFLINE_HOST: Upgrade: Failed
>>> to connect to host host1 at addr (x.x.x.x)
>>>
>>> As I can see here, it turns out sudo is added to the code to be able to
>>> continue:
>>>
>>>
>>> https://github.com/ceph/ceph/blob/v17.2.5/src/pybind/mgr/cephadm/ssh.py#L143
>>>
>>> I cannot privilege the cephadmin user to run sudo commands for some
>>> policy reasons, could this be the root cause of the issue?
>>>
>>> Best regards,
>>> Reza
>>>
>>> On Thu, 6 Apr 2023 at 14:59, Adam King <adking@xxxxxxxxxx> wrote:
>>>
>>>> Does "ceph health detail" give any insight into what the unexpected
>>>> exception was? If not, I'm pretty confident some traceback would end up
>>>> being logged. Could maybe still grab it with "ceph log last 200 info
>>>> cephadm" if not a lot else has happened. Also, probably need to find out if
>>>> the check-host is failing due to the check on the host actually failing or
>>>> failing to connect to the host. Could try putting a copy of the cephadm
>>>> binary on one and running "cephadm check-host --expect-hostname <hostname>"
>>>> where the hostname is the name cephadm knows the host by. If that's not an
>>>> issue I'd expect it's a connection thing. Could maybe try going through
>>>>  https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#ssh-errors
>>>> <https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#ssh-errors>.
>>>> Cephadm changed the backend ssh library from pacific to quincy due to the
>>>> one used in pacific no longer being supported so it's possible some general
>>>> ssh error has popped up in your env as a result.
>>>>
>>>> On Thu, Apr 6, 2023 at 8:38 AM Reza Bakhshayeshi <reza.b2008@xxxxxxxxx>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I have a problem regarding upgrading Ceph cluster from Pacific to
>>>>> Quincy
>>>>> version with cephadm. I have successfully upgraded the cluster to the
>>>>> latest Pacific (16.2.11). But when I run the following command to
>>>>> upgrade
>>>>> the cluster to 17.2.5, After upgrading 3/4 mgrs, the upgrade process
>>>>> stops
>>>>> with "Unexpected error". (everything is on a private network)
>>>>>
>>>>> ceph orch upgrade start my-private-repo/quay-io/ceph/ceph:v17.2.5
>>>>>
>>>>> I also tried the 17.2.4 version.
>>>>>
>>>>> cephadm fails to check the hosts' status and marks them as offline:
>>>>>
>>>>> cephadm 2023-04-06T10:19:59.998510+0000 mgr.host9.arhpnd (mgr.4516356)
>>>>> 5782
>>>>> : cephadm [DBG]  host host4 (x.x.x.x) failed check
>>>>> cephadm 2023-04-06T10:19:59.998553+0000 mgr.host9.arhpnd (mgr.4516356)
>>>>> 5783
>>>>> : cephadm [DBG] Host "host4" marked as offline. Skipping daemon refresh
>>>>> cephadm 2023-04-06T10:19:59.998581+0000 mgr.host9.arhpnd (mgr.4516356)
>>>>> 5784
>>>>> : cephadm [DBG] Host "host4" marked as offline. Skipping gather facts
>>>>> refresh
>>>>> cephadm 2023-04-06T10:19:59.998609+0000 mgr.host9.arhpnd (mgr.4516356)
>>>>> 5785
>>>>> : cephadm [DBG] Host "host4" marked as offline. Skipping network
>>>>> refresh
>>>>> cephadm 2023-04-06T10:19:59.998633+0000 mgr.host9.arhpnd (mgr.4516356)
>>>>> 5786
>>>>> : cephadm [DBG] Host "host4" marked as offline. Skipping device refresh
>>>>> cephadm 2023-04-06T10:19:59.998659+0000 mgr.host9.arhpnd (mgr.4516356)
>>>>> 5787
>>>>> : cephadm [DBG] Host "host4" marked as offline. Skipping osdspec
>>>>> preview
>>>>> refresh
>>>>> cephadm 2023-04-06T10:19:59.998682+0000 mgr.host9.arhpnd (mgr.4516356)
>>>>> 5788
>>>>> : cephadm [DBG] Host "host4" marked as offline. Skipping autotune
>>>>> cluster 2023-04-06T10:20:00.000151+0000 mon.host8 (mon.0) 158587 :
>>>>> cluster
>>>>> [ERR] Health detail: HEALTH_ERR 9 hosts fail cephadm check; Upgrade:
>>>>> failed
>>>>> due to an unexpected exception
>>>>> cluster 2023-04-06T10:20:00.000191+0000 mon.host8 (mon.0) 158588 :
>>>>> cluster
>>>>> [ERR] [WRN] CEPHADM_HOST_CHECK_FAILED: 9 hosts fail cephadm check
>>>>> cluster 2023-04-06T10:20:00.000202+0000 mon.host8 (mon.0) 158589 :
>>>>> cluster
>>>>> [ERR]     host host7 (x.x.x.x) failed check: Unable to reach remote
>>>>> host
>>>>> host7. Process exited with non-zero exit status 3
>>>>> cluster 2023-04-06T10:20:00.000213+0000 mon.host8 (mon.0) 158590 :
>>>>> cluster
>>>>> [ERR]     host host2 (x.x.x.x) failed check: Unable to reach remote
>>>>> host
>>>>> host2. Process exited with non-zero exit status 3
>>>>> cluster 2023-04-06T10:20:00.000220+0000 mon.host8 (mon.0) 158591 :
>>>>> cluster
>>>>> [ERR]     host host8 (x.x.x.x) failed check: Unable to reach remote
>>>>> host
>>>>> host8. Process exited with non-zero exit status 3
>>>>> cluster 2023-04-06T10:20:00.000228+0000 mon.host8 (mon.0) 158592 :
>>>>> cluster
>>>>> [ERR]     host host4 (x.x.x.x) failed check: Unable to reach remote
>>>>> host
>>>>> host4. Process exited with non-zero exit status 3
>>>>> cluster 2023-04-06T10:20:00.000240+0000 mon.host8 (mon.0) 158593 :
>>>>> cluster
>>>>> [ERR]     host host3 (x.x.x.x) failed check: Unable to reach remote
>>>>> host
>>>>> host3. Process exited with non-zero exit status 3
>>>>>
>>>>> and here are some outputs of the commands:
>>>>>
>>>>> [root@host8 ~]# ceph -s
>>>>>   cluster:
>>>>>     id:     xxx
>>>>>     health: HEALTH_ERR
>>>>>             9 hosts fail cephadm check
>>>>>             Upgrade: failed due to an unexpected exception
>>>>>
>>>>>   services:
>>>>>     mon: 5 daemons, quorum host8,host1,host7,host2,host9 (age 2w)
>>>>>     mgr: host9.arhpnd(active, since 105m), standbys: host8.jowfih,
>>>>> host1.warjsr, host2.qyavjj
>>>>>     mds: 1/1 daemons up, 3 standby
>>>>>     osd: 37 osds: 37 up (since 8h), 37 in (since 3w)
>>>>>
>>>>>   data:
>>>>>
>>>>>
>>>>>   io:
>>>>>     client:
>>>>>
>>>>>   progress:
>>>>>     Upgrade to 17.2.5 (0s)
>>>>>       [............................]
>>>>>
>>>>> [root@host8 ~]# ceph orch upgrade status
>>>>> {
>>>>>     "target_image": "my-private-repo/quay-io/ceph/ceph@sha256
>>>>> :34c763383e3323c6bb35f3f2229af9f466518d9db926111277f5e27ed543c427",
>>>>>     "in_progress": true,
>>>>>     "which": "Upgrading all daemon types on all hosts",
>>>>>     "services_complete": [],
>>>>>     "progress": "3/59 daemons upgraded",
>>>>>     "message": "Error: UPGRADE_EXCEPTION: Upgrade: failed due to an
>>>>> unexpected exception",
>>>>>     "is_paused": true
>>>>> }
>>>>> [root@host8 ~]# ceph cephadm check-host host7
>>>>> check-host failed:
>>>>> Host 'host7' not found. Use 'ceph orch host ls' to see all managed
>>>>> hosts.
>>>>> [root@host8 ~]# ceph versions
>>>>> {
>>>>>     "mon": {
>>>>>         "ceph version 16.2.11
>>>>> (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
>>>>> pacific (stable)": 5
>>>>>     },
>>>>>     "mgr": {
>>>>>         "ceph version 16.2.11
>>>>> (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
>>>>> pacific (stable)": 1,
>>>>>         "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
>>>>> quincy (stable)": 3
>>>>>     },
>>>>>     "osd": {
>>>>>         "ceph version 16.2.11
>>>>> (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
>>>>> pacific (stable)": 37
>>>>>     },
>>>>>     "mds": {
>>>>>         "ceph version 16.2.11
>>>>> (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
>>>>> pacific (stable)": 4
>>>>>     },
>>>>>     "overall": {
>>>>>         "ceph version 16.2.11
>>>>> (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
>>>>> pacific (stable)": 47,
>>>>>         "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
>>>>> quincy (stable)": 3
>>>>>     }
>>>>> }
>>>>>
>>>>> The strange thing is I can rollback the cluster status by failing to
>>>>> not-upgraded mgr like this:
>>>>>
>>>>> ceph mgr fail
>>>>> ceph orch upgrade start my-private-repo/quay-io/ceph/ceph:v16.2.11
>>>>>
>>>>> Would you happen to have any idea about this?
>>>>>
>>>>> Best regards,
>>>>> Reza
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>
>>>>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx