SOLVED - Re: Failure to bootstrap cluster with cephadm - unable to reach (localhost)

ceph-mail@xxxxxxxxxxxxxxxx · Tue, 09 Aug 2022 12:21:26 +0000

Hi all,

I didn't know if it was only myself not receiving the posts to the list or if no-one was receiving them, but I did receive a response to my most recent email, so I guess I'm writing these emails in a poor or uninteresting way.

In any case, final update on this issue. After having spent a few weeks trying to troubleshoot the failed bootstrap situation, investigating the ssh server settings, copying of ssh-keys, was there an bug in the container version of asyncssh?, did I have the correct version of Python 3.9 (on the host) and 3.6 (in the container), and going through the verbose logs outside and inside various containers on the proxmox hosts and on the clean debian hosts, I was finally able to found the issue!

It turns out cephadm requires sudo to be installed on the host, and has a fatal failure if it's not installed. Unfortunately, this is neither documented as a requirement nor checked for in check-host. I just needed to do apt install sudo on the proxmox host to fix the issue.

What happens is that cephadm, as part of adding a new host, now runs "sudo true" and checks the result, even if run as root. As there is no problem bootstrapping a cluster with pacific, it seems this test (sudo true) was added without the documentation or check-host and prepare-host being updated as well.

Investigating further, I found that the command:

r = await conn.run('sudo true', check=True, timeout=5)

resulting in the error:

Unable to reach remote host xxxx. Process exited with a non-zero exit status 127

comes from line 143 in ssh.py<https://github.com/ceph/ceph/blob/v17.2.3/src/pybind/mgr/cephadm/ssh.py#L143> and seems to have been added with this commit<https://github.com/ceph/ceph/commit/8ff2bcf6b0b53c3928c20a67f3da2003f858b3fb>.

Nothing in the commit or linked ticket seems to require the sudo check, so maybe it's an artefact from another change that just got included and since sudo is almost always installed by default, no one else stumbled upon it. Perhaps it can be removed?

Hope the details above helps someone else and perhaps a fix for cephadm.

Best

________________________________

Sent: 05 August 2022 17:17
Subject: Re: Failure to bootstrap cluster with cephadm - unable to reach (localhost)

Hi again,

further updates about the issue. It seems the difference is in relation to the new ssh.py that replaced the previous functionality in PR 42051<https://github.com/ceph/ceph/pull/42051>.

As mentioned initially, I'm getting these problems trying to bootstrap quincy on a clean install of proxmox. It works on a clean debian installation in a vm but I have not been able to understand the difference. Bootstrapping quincy 17.2.3 on a new debian vm and then trying to add the proxmox host gives the same error about not being able to reach the host but gave me some additional information.

-----
mgr.quincy-mon1.xxx [DBG] Sleeping for 60 seconds
mgr.quincy-mon1.xxx [DBG] Opening connection to root@xxxxxxxxxxxxxxx with ssh options '-F /tmp/cephadm-conf-24 -i/tpm/ceph-identity-djkg'
mgr.quincy-mon1.xxx [DBG] _run_cephadm : command = check-host
mgr.quincy-mon1.xxx [DBG] _run_cephadm : args = ['--expect-hostname', 'pvexxxx']
mgr.quincy-mon1.xxx [DBG] args: check-host --expect-hostname pvexxxx
mgr.quincy-mon1.xxx [DBG] Opening connection to root@xxxxxxxxxxxxxxx with ssh options '-F /tmp/cephadm-conf-24 -i/tpm/ceph-identity-djkg'
mgr.quincy-mon1.xxx [DBG] Running comman: which python3
mgr.quincy-mon1.xxx [DBG] Connection to pvexxxx failed. Process exited with non-zero exit status 127
mgr.quincy-mon1.xxx [DBG] _reset_con close pvexxxx
mgr.quincy-mon1.xxx [ERR] Unable to reach remote host pvexxxx. Process exited with non-zero exit status 127
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/ssh.py", line 143, in _execute_command
    r = await conn.run('sudo true', check=True, timeout=5)
  File "/lib/python3.6/site-packages/asyncssh/connection.py", line 3637, in run
    return await process.wait(check, timeout)
  File "/lib/python3.6/site-packages/asyncssh/process.py", line 1257, in wait
    self.returncode, stdout_data, stderr_data
asyncssh.process.ProcessError: Process exited with non-zero exit status 127

During handling of the above exception, another exception occured
-----

Any ideas or suggestions on how to troubleshoot further?

Thanks

________________________________
Sent: 01 August 2022 16:26
Subject: Re: Failure to bootstrap cluster with cephadm - unable to reach (localhost)

Hi all,

Some updated information on my issue. I have now tried to bootstrap a cluster using images v17.2.2 (original attempt), v17.2.1, v17.2.3 and 16.2.10.

All Quincy container images failed but the Pacific image had no problem, worked like a charm.

Was there any change between Pacific and Quincy related to how hosts are added or with the container network that could point me towards an explanation?

I think I'll attempt an update the cluster from Pacific to Quincy, see if that works and see if it's possible to add Quincy hosts to the cluster afterwards.

Thanks
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx