for setting the user, `ceph cephadm set-user` command should do it. Bit surprised by the second part of that though. With passwordless sudo access I would have expected that to start working. On Thu, May 4, 2023 at 11:27 AM Reza Bakhshayeshi <reza.b2008@xxxxxxxxx> wrote: > Thank you. > I don't see any more errors rather than: > > 2023-05-04T15:07:38.003+0000 7ff96cbe0700 0 log_channel(cephadm) log > [DBG] : Running command: sudo which python3 > 2023-05-04T15:07:38.025+0000 7ff96cbe0700 0 log_channel(cephadm) log > [DBG] : Connection to host1 failed. Process exited with non-zero exit > status 3 > 2023-05-04T15:07:38.025+0000 7ff96cbe0700 0 log_channel(cephadm) log > [DBG] : _reset_con close host1 > > What is the best way to safely change the cephadm user to root for the > existing cluster? It seems "ceph cephadm set-ssh-config" is not effective > (BTW, my cephadmin user can run "sudo which python3" without prompting > password on other hosts now, but nothing has been solved) > > Best regards, > Reza > > On Tue, 2 May 2023 at 19:00, Adam King <adking@xxxxxxxxxx> wrote: > >> The number of mgr daemons thing is expected. The way it works is it first >> upgrades all the standby mgrs (which will be all but one) and then fails >> over so the previously active mgr can be upgraded as well. After that >> failover is when it's first actually running the newer cephadm code, which >> is when you're hitting this issue. Are the logs still saying something >> similar about how "sudo which python3" is failing? I'm thinking this >> might just be a general issue with the user being used not having >> passwordless sudo access, that sort of accidentally working in pacific, but >> now not working any more in quincy. If the log lines confirm the same, we >> might have to work on something in order to handle this case (making the >> sudo optional somehow). As mentioned in the previous email, that setup >> wasn't intended to be supported even in pacific, although if it did work, >> we could bring something in to make it usable in quincy onward as well. >> >> On Tue, May 2, 2023 at 10:58 AM Reza Bakhshayeshi <reza.b2008@xxxxxxxxx> >> wrote: >> >>> Hi Adam, >>> >>> I'm still struggling with this issue. I also checked it one more time >>> with newer versions, upgrading the cluster from 16.2.11 to 16.2.12 was >>> successful but from 16.2.12 to 17.2.6 failed again with the same ssh errors >>> (I checked >>> https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#ssh-errors a >>> couple of times and all keys/access are fine). >>> >>> [root@host1 ~]# ceph health detail >>> HEALTH_ERR Upgrade: Failed to connect to host host2 at addr (x.x.x.x) >>> [ERR] UPGRADE_OFFLINE_HOST: Upgrade: Failed to connect to host host2 at >>> addr (x.x.x.x) >>> SSH connection failed to host2 at addr (x.x.x.x): Host(s) were >>> marked offline: {'host2', 'host6', 'host9', 'host4', 'host3', 'host5', >>> 'host1', 'host7', 'host8'} >>> >>> The interesting thing is that always (total number of mgrs) - 1 is >>> upgraded, If I provision 5 MGRs then 4 of them, and for 3, 2 of them! >>> >>> As long as I'm in an internal environment, I also checked the process >>> with Quincy cephadm binary file. FYI I'm using stretch mode on this cluster. >>> >>> I don't understand why Quincy MGRs cannot ssh into Pacific nodes, if you >>> have any more hints I would be really glad to hear. >>> >>> Best regards, >>> Reza >>> >>> >>> >>> On Wed, 12 Apr 2023 at 17:18, Adam King <adking@xxxxxxxxxx> wrote: >>> >>>> Ah, okay. Someone else had opened an issue about the same thing after >>>> the 17.2.5 release I believe. It's changed in 17.2.6 at least to only use >>>> sudo for non-root users >>>> https://github.com/ceph/ceph/blob/v17.2.6/src/pybind/mgr/cephadm/ssh.py#L148-L153. >>>> But it looks like you're also using a non-root user anyway. We've required >>>> passwordless sudo access for custom ssh users for a long time I think (e.g. >>>> it's in pacific docs >>>> https://docs.ceph.com/en/pacific/cephadm/install/#further-information-about-cephadm-bootstrap, >>>> see the point on "--ssh-user"). Did this actually work for you before in >>>> pacific with a non-root user that doesn't have sudo privileges? I had >>>> assumed that had never worked. >>>> >>>> On Wed, Apr 12, 2023 at 10:38 AM Reza Bakhshayeshi < >>>> reza.b2008@xxxxxxxxx> wrote: >>>> >>>>> Thank you Adam for your response, >>>>> >>>>> I tried all your comments and the troubleshooting link you sent. From >>>>> the Quincy mgrs containers, they can ssh into all other Pacific nodes >>>>> successfully by running the exact command in the log output and vice versa. >>>>> >>>>> Here are some debug logs from the cephadm while updating: >>>>> >>>>> 2023-04-12T11:35:56.260958+0000 mgr.host8.jukgqm (mgr.4468627) 103 : >>>>> cephadm [DBG] Opening connection to cephadmin@x.x.x.x with ssh >>>>> options '-F /tmp/cephadm-conf-2bbfubub -i /tmp/cephadm-identity-7x2m8gvr' >>>>> 2023-04-12T11:35:56.525091+0000 mgr.host8.jukgqm (mgr.4468627) 144 : >>>>> cephadm [DBG] _run_cephadm : command = ls >>>>> 2023-04-12T11:35:56.525406+0000 mgr.host8.jukgqm (mgr.4468627) 145 : >>>>> cephadm [DBG] _run_cephadm : args = [] >>>>> 2023-04-12T11:35:56.525571+0000 mgr.host8.jukgqm (mgr.4468627) 146 : >>>>> cephadm [DBG] mon container image my-private-repo/quay-io/ceph/ceph@sha256 >>>>> :1b9803c8984bef8b82f05e233e8fe8ed8f0bba8e5cc2c57f6efaccbeea682add >>>>> 2023-04-12T11:35:56.525619+0000 mgr.host8.jukgqm (mgr.4468627) 147 : >>>>> cephadm [DBG] args: --image my-private-repo/quay-io/ceph/ceph@sha256:1b9803c8984bef8b82f05e233e8fe8ed8f0bba8e5cc2c57f6efaccbeea682add >>>>> ls >>>>> 2023-04-12T11:35:56.525738+0000 mgr.host8.jukgqm (mgr.4468627) 148 : >>>>> cephadm [DBG] Running command: sudo which python3 >>>>> 2023-04-12T11:35:56.534227+0000 mgr.host8.jukgqm (mgr.4468627) 149 : >>>>> cephadm [DBG] Connection to host1 failed. Process exited with non-zero exit >>>>> status 3 >>>>> 2023-04-12T11:35:56.534275+0000 mgr.host8.jukgqm (mgr.4468627) 150 : >>>>> cephadm [DBG] _reset_con close host1 >>>>> 2023-04-12T11:35:56.540135+0000 mgr.host8.jukgqm (mgr.4468627) 158 : >>>>> cephadm [DBG] Host "host1" marked as offline. Skipping gather facts refresh >>>>> 2023-04-12T11:35:56.540178+0000 mgr.host8.jukgqm (mgr.4468627) 159 : >>>>> cephadm [DBG] Host "host1" marked as offline. Skipping network refresh >>>>> 2023-04-12T11:35:56.540408+0000 mgr.host8.jukgqm (mgr.4468627) 160 : >>>>> cephadm [DBG] Host "host1" marked as offline. Skipping device refresh >>>>> 2023-04-12T11:35:56.540490+0000 mgr.host8.jukgqm (mgr.4468627) 161 : >>>>> cephadm [DBG] Host "host1" marked as offline. Skipping osdspec preview >>>>> refresh >>>>> 2023-04-12T11:35:56.540527+0000 mgr.host8.jukgqm (mgr.4468627) 162 : >>>>> cephadm [DBG] Host "host1" marked as offline. Skipping autotune >>>>> 2023-04-12T11:35:56.540978+0000 mgr.host8.jukgqm (mgr.4468627) 163 : >>>>> cephadm [DBG] Connection to host1 failed. Process exited with non-zero exit >>>>> status 3 >>>>> 2023-04-12T11:35:56.796966+0000 mgr.host8.jukgqm (mgr.4468627) 728 : >>>>> cephadm [ERR] Upgrade: Paused due to UPGRADE_OFFLINE_HOST: Upgrade: Failed >>>>> to connect to host host1 at addr (x.x.x.x) >>>>> >>>>> As I can see here, it turns out sudo is added to the code to be able >>>>> to continue: >>>>> >>>>> >>>>> https://github.com/ceph/ceph/blob/v17.2.5/src/pybind/mgr/cephadm/ssh.py#L143 >>>>> >>>>> I cannot privilege the cephadmin user to run sudo commands for some >>>>> policy reasons, could this be the root cause of the issue? >>>>> >>>>> Best regards, >>>>> Reza >>>>> >>>>> On Thu, 6 Apr 2023 at 14:59, Adam King <adking@xxxxxxxxxx> wrote: >>>>> >>>>>> Does "ceph health detail" give any insight into what the unexpected >>>>>> exception was? If not, I'm pretty confident some traceback would end up >>>>>> being logged. Could maybe still grab it with "ceph log last 200 info >>>>>> cephadm" if not a lot else has happened. Also, probably need to find out if >>>>>> the check-host is failing due to the check on the host actually failing or >>>>>> failing to connect to the host. Could try putting a copy of the cephadm >>>>>> binary on one and running "cephadm check-host --expect-hostname <hostname>" >>>>>> where the hostname is the name cephadm knows the host by. If that's not an >>>>>> issue I'd expect it's a connection thing. Could maybe try going through >>>>>> https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#ssh-errors >>>>>> <https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#ssh-errors>. >>>>>> Cephadm changed the backend ssh library from pacific to quincy due to the >>>>>> one used in pacific no longer being supported so it's possible some general >>>>>> ssh error has popped up in your env as a result. >>>>>> >>>>>> On Thu, Apr 6, 2023 at 8:38 AM Reza Bakhshayeshi < >>>>>> reza.b2008@xxxxxxxxx> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I have a problem regarding upgrading Ceph cluster from Pacific to >>>>>>> Quincy >>>>>>> version with cephadm. I have successfully upgraded the cluster to the >>>>>>> latest Pacific (16.2.11). But when I run the following command to >>>>>>> upgrade >>>>>>> the cluster to 17.2.5, After upgrading 3/4 mgrs, the upgrade process >>>>>>> stops >>>>>>> with "Unexpected error". (everything is on a private network) >>>>>>> >>>>>>> ceph orch upgrade start my-private-repo/quay-io/ceph/ceph:v17.2.5 >>>>>>> >>>>>>> I also tried the 17.2.4 version. >>>>>>> >>>>>>> cephadm fails to check the hosts' status and marks them as offline: >>>>>>> >>>>>>> cephadm 2023-04-06T10:19:59.998510+0000 mgr.host9.arhpnd >>>>>>> (mgr.4516356) 5782 >>>>>>> : cephadm [DBG] host host4 (x.x.x.x) failed check >>>>>>> cephadm 2023-04-06T10:19:59.998553+0000 mgr.host9.arhpnd >>>>>>> (mgr.4516356) 5783 >>>>>>> : cephadm [DBG] Host "host4" marked as offline. Skipping daemon >>>>>>> refresh >>>>>>> cephadm 2023-04-06T10:19:59.998581+0000 mgr.host9.arhpnd >>>>>>> (mgr.4516356) 5784 >>>>>>> : cephadm [DBG] Host "host4" marked as offline. Skipping gather facts >>>>>>> refresh >>>>>>> cephadm 2023-04-06T10:19:59.998609+0000 mgr.host9.arhpnd >>>>>>> (mgr.4516356) 5785 >>>>>>> : cephadm [DBG] Host "host4" marked as offline. Skipping network >>>>>>> refresh >>>>>>> cephadm 2023-04-06T10:19:59.998633+0000 mgr.host9.arhpnd >>>>>>> (mgr.4516356) 5786 >>>>>>> : cephadm [DBG] Host "host4" marked as offline. Skipping device >>>>>>> refresh >>>>>>> cephadm 2023-04-06T10:19:59.998659+0000 mgr.host9.arhpnd >>>>>>> (mgr.4516356) 5787 >>>>>>> : cephadm [DBG] Host "host4" marked as offline. Skipping osdspec >>>>>>> preview >>>>>>> refresh >>>>>>> cephadm 2023-04-06T10:19:59.998682+0000 mgr.host9.arhpnd >>>>>>> (mgr.4516356) 5788 >>>>>>> : cephadm [DBG] Host "host4" marked as offline. Skipping autotune >>>>>>> cluster 2023-04-06T10:20:00.000151+0000 mon.host8 (mon.0) 158587 : >>>>>>> cluster >>>>>>> [ERR] Health detail: HEALTH_ERR 9 hosts fail cephadm check; Upgrade: >>>>>>> failed >>>>>>> due to an unexpected exception >>>>>>> cluster 2023-04-06T10:20:00.000191+0000 mon.host8 (mon.0) 158588 : >>>>>>> cluster >>>>>>> [ERR] [WRN] CEPHADM_HOST_CHECK_FAILED: 9 hosts fail cephadm check >>>>>>> cluster 2023-04-06T10:20:00.000202+0000 mon.host8 (mon.0) 158589 : >>>>>>> cluster >>>>>>> [ERR] host host7 (x.x.x.x) failed check: Unable to reach remote >>>>>>> host >>>>>>> host7. Process exited with non-zero exit status 3 >>>>>>> cluster 2023-04-06T10:20:00.000213+0000 mon.host8 (mon.0) 158590 : >>>>>>> cluster >>>>>>> [ERR] host host2 (x.x.x.x) failed check: Unable to reach remote >>>>>>> host >>>>>>> host2. Process exited with non-zero exit status 3 >>>>>>> cluster 2023-04-06T10:20:00.000220+0000 mon.host8 (mon.0) 158591 : >>>>>>> cluster >>>>>>> [ERR] host host8 (x.x.x.x) failed check: Unable to reach remote >>>>>>> host >>>>>>> host8. Process exited with non-zero exit status 3 >>>>>>> cluster 2023-04-06T10:20:00.000228+0000 mon.host8 (mon.0) 158592 : >>>>>>> cluster >>>>>>> [ERR] host host4 (x.x.x.x) failed check: Unable to reach remote >>>>>>> host >>>>>>> host4. Process exited with non-zero exit status 3 >>>>>>> cluster 2023-04-06T10:20:00.000240+0000 mon.host8 (mon.0) 158593 : >>>>>>> cluster >>>>>>> [ERR] host host3 (x.x.x.x) failed check: Unable to reach remote >>>>>>> host >>>>>>> host3. Process exited with non-zero exit status 3 >>>>>>> >>>>>>> and here are some outputs of the commands: >>>>>>> >>>>>>> [root@host8 ~]# ceph -s >>>>>>> cluster: >>>>>>> id: xxx >>>>>>> health: HEALTH_ERR >>>>>>> 9 hosts fail cephadm check >>>>>>> Upgrade: failed due to an unexpected exception >>>>>>> >>>>>>> services: >>>>>>> mon: 5 daemons, quorum host8,host1,host7,host2,host9 (age 2w) >>>>>>> mgr: host9.arhpnd(active, since 105m), standbys: host8.jowfih, >>>>>>> host1.warjsr, host2.qyavjj >>>>>>> mds: 1/1 daemons up, 3 standby >>>>>>> osd: 37 osds: 37 up (since 8h), 37 in (since 3w) >>>>>>> >>>>>>> data: >>>>>>> >>>>>>> >>>>>>> io: >>>>>>> client: >>>>>>> >>>>>>> progress: >>>>>>> Upgrade to 17.2.5 (0s) >>>>>>> [............................] >>>>>>> >>>>>>> [root@host8 ~]# ceph orch upgrade status >>>>>>> { >>>>>>> "target_image": "my-private-repo/quay-io/ceph/ceph@sha256 >>>>>>> :34c763383e3323c6bb35f3f2229af9f466518d9db926111277f5e27ed543c427", >>>>>>> "in_progress": true, >>>>>>> "which": "Upgrading all daemon types on all hosts", >>>>>>> "services_complete": [], >>>>>>> "progress": "3/59 daemons upgraded", >>>>>>> "message": "Error: UPGRADE_EXCEPTION: Upgrade: failed due to an >>>>>>> unexpected exception", >>>>>>> "is_paused": true >>>>>>> } >>>>>>> [root@host8 ~]# ceph cephadm check-host host7 >>>>>>> check-host failed: >>>>>>> Host 'host7' not found. Use 'ceph orch host ls' to see all managed >>>>>>> hosts. >>>>>>> [root@host8 ~]# ceph versions >>>>>>> { >>>>>>> "mon": { >>>>>>> "ceph version 16.2.11 >>>>>>> (3cf40e2dca667f68c6ce3ff5cd94f01e711af894) >>>>>>> pacific (stable)": 5 >>>>>>> }, >>>>>>> "mgr": { >>>>>>> "ceph version 16.2.11 >>>>>>> (3cf40e2dca667f68c6ce3ff5cd94f01e711af894) >>>>>>> pacific (stable)": 1, >>>>>>> "ceph version 17.2.5 >>>>>>> (98318ae89f1a893a6ded3a640405cdbb33e08757) >>>>>>> quincy (stable)": 3 >>>>>>> }, >>>>>>> "osd": { >>>>>>> "ceph version 16.2.11 >>>>>>> (3cf40e2dca667f68c6ce3ff5cd94f01e711af894) >>>>>>> pacific (stable)": 37 >>>>>>> }, >>>>>>> "mds": { >>>>>>> "ceph version 16.2.11 >>>>>>> (3cf40e2dca667f68c6ce3ff5cd94f01e711af894) >>>>>>> pacific (stable)": 4 >>>>>>> }, >>>>>>> "overall": { >>>>>>> "ceph version 16.2.11 >>>>>>> (3cf40e2dca667f68c6ce3ff5cd94f01e711af894) >>>>>>> pacific (stable)": 47, >>>>>>> "ceph version 17.2.5 >>>>>>> (98318ae89f1a893a6ded3a640405cdbb33e08757) >>>>>>> quincy (stable)": 3 >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> The strange thing is I can rollback the cluster status by failing to >>>>>>> not-upgraded mgr like this: >>>>>>> >>>>>>> ceph mgr fail >>>>>>> ceph orch upgrade start my-private-repo/quay-io/ceph/ceph:v16.2.11 >>>>>>> >>>>>>> Would you happen to have any idea about this? >>>>>>> >>>>>>> Best regards, >>>>>>> Reza >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>>>> >>>>>>> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx