Re: CEPHADM_HOST_CHECK_FAILED

Adam King <adking@xxxxxxxxxx> · Thu, 4 Apr 2024 07:47:17 -0400

First, I guess I would make sure that peon7 and peon12 actually could pass
the host check (you can run "cephadm check-host" on the host directly if
you have a copy of the cephadm binary there) Then I'd try a mgr failover
(ceph mgr fail) to clear out any in memory host values cephadm might have
and restart the module. If it still reproduces after that, then you might
have to set mgr/cephadm/log_to_cluster_level to debug, do another mgr
failover, wait until the module crashes and see if "ceph log last 100 debug
cephadm" gives more info on where the crash occurred (it might have an
actual traceback).

On Thu, Apr 4, 2024 at 4:51 AM <arnoud@fuga.cloud> wrote:

> Hi,
>
> I’ve added some new nodes to our Ceph cluster. Only did the host add, had
> not added the OSD’s yet.
> Due to a configuration error I had to reinstall some of them. But I forgot
> to remove the nodes from Ceph first. I did a “ceph orch host rm peon7
> --offline —force” before re-adding them to the cluster.
>
> All the nodes are showing up in the host list (all the peons are the new
> ones):
>
> # ceph orch host ls
> HOST         ADDR         LABELS  STATUS
> ceph1        10.103.0.71
> ceph2        10.103.0.72
> ceph3        10.103.0.73
> ceph4        10.103.0.74
> compute1     10.103.0.11
> compute2     10.103.0.12
> compute3     10.103.0.13
> compute4     10.103.0.14
> controller1  10.103.0.8
> controller2  10.103.0.9
> controller3  10.103.0.10
> peon1        10.103.0.41
> peon2        10.103.0.42
> peon3        10.103.0.43
> peon4        10.103.0.44
> peon5        10.103.0.45
> peon6        10.103.0.46
> peon7        10.103.0.47
> peon8        10.103.0.48
> peon9        10.103.0.49
> peon10       10.103.0.50
> peon12       10.103.0.52
> peon13       10.103.0.53
> peon14       10.103.0.54
> peon15       10.103.0.55
> peon16       10.103.0.56
>
> But Ceph status still shows an error, which I can’t seem to get rid off.
>
> [WRN] CEPHADM_HOST_CHECK_FAILED: 2 hosts fail cephadm check
>     host peon7 (10.103.0.47) failed check: Can't communicate with remote
> host `10.103.0.47`, possibly because python3 is not installed there or you
> are missing NOPASSWD in sudoers. [Errno 113] Connect call failed
> ('10.103.0.47', 22)
>     host peon12 (10.103.0.52) failed check: Can't communicate with remote
> host `10.103.0.52`, possibly because python3 is not installed there or you
> are missing NOPASSWD in sudoers. [Errno 113] Connect call failed
> ('10.103.0.52', 22)
> [ERR] MGR_MODULE_ERROR: Module 'cephadm' has failed: 'peon7'
>     Module 'cephadm' has failed: ‘peon7'
>
> From the mgr log:
>
> Apr 04 08:33:46 controller2 bash[4031857]: debug
> 2024-04-04T08:33:46.876+0000 7f2bb5710700 -1 mgr.server reply reply (5)
> Input/output error Module 'cephadm' has experienced an error and cannot
> handle commands: 'peon7'
>
> Any idea how to clear this error?
>
> # ceph --version
> ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus
> (stable)
>
>
> Regards,
> Arnoud de Jonge.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx