First, I guess I would make sure that peon7 and peon12 actually could pass the host check (you can run "cephadm check-host" on the host directly if you have a copy of the cephadm binary there) Then I'd try a mgr failover (ceph mgr fail) to clear out any in memory host values cephadm might have and restart the module. If it still reproduces after that, then you might have to set mgr/cephadm/log_to_cluster_level to debug, do another mgr failover, wait until the module crashes and see if "ceph log last 100 debug cephadm" gives more info on where the crash occurred (it might have an actual traceback). On Thu, Apr 4, 2024 at 4:51 AM <arnoud@fuga.cloud> wrote: > Hi, > > I’ve added some new nodes to our Ceph cluster. Only did the host add, had > not added the OSD’s yet. > Due to a configuration error I had to reinstall some of them. But I forgot > to remove the nodes from Ceph first. I did a “ceph orch host rm peon7 > --offline —force” before re-adding them to the cluster. > > All the nodes are showing up in the host list (all the peons are the new > ones): > > # ceph orch host ls > HOST ADDR LABELS STATUS > ceph1 10.103.0.71 > ceph2 10.103.0.72 > ceph3 10.103.0.73 > ceph4 10.103.0.74 > compute1 10.103.0.11 > compute2 10.103.0.12 > compute3 10.103.0.13 > compute4 10.103.0.14 > controller1 10.103.0.8 > controller2 10.103.0.9 > controller3 10.103.0.10 > peon1 10.103.0.41 > peon2 10.103.0.42 > peon3 10.103.0.43 > peon4 10.103.0.44 > peon5 10.103.0.45 > peon6 10.103.0.46 > peon7 10.103.0.47 > peon8 10.103.0.48 > peon9 10.103.0.49 > peon10 10.103.0.50 > peon12 10.103.0.52 > peon13 10.103.0.53 > peon14 10.103.0.54 > peon15 10.103.0.55 > peon16 10.103.0.56 > > But Ceph status still shows an error, which I can’t seem to get rid off. > > [WRN] CEPHADM_HOST_CHECK_FAILED: 2 hosts fail cephadm check > host peon7 (10.103.0.47) failed check: Can't communicate with remote > host `10.103.0.47`, possibly because python3 is not installed there or you > are missing NOPASSWD in sudoers. [Errno 113] Connect call failed > ('10.103.0.47', 22) > host peon12 (10.103.0.52) failed check: Can't communicate with remote > host `10.103.0.52`, possibly because python3 is not installed there or you > are missing NOPASSWD in sudoers. [Errno 113] Connect call failed > ('10.103.0.52', 22) > [ERR] MGR_MODULE_ERROR: Module 'cephadm' has failed: 'peon7' > Module 'cephadm' has failed: ‘peon7' > > From the mgr log: > > Apr 04 08:33:46 controller2 bash[4031857]: debug > 2024-04-04T08:33:46.876+0000 7f2bb5710700 -1 mgr.server reply reply (5) > Input/output error Module 'cephadm' has experienced an error and cannot > handle commands: 'peon7' > > Any idea how to clear this error? > > # ceph --version > ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus > (stable) > > > Regards, > Arnoud de Jonge. > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx