PGs unknown (osd down) after conversion to cephadm

"Dr. Marco Savoca" <quaternionma@xxxxxxxxx> · Thu, 9 Apr 2020 13:23:32 +0200

Hi all,

last week I successfully upgraded my cluster to Octopus and converted it to cephadm. The conversion process (according to the docs) went well and the cluster ran in an active+clean status.

But after a reboot all osd went down with a delay of a couple of minutes after reboot and all (100%) of the PGs ran into the unknown state. The orchestrator isn’t reacheable during this state (ceph orch status doesn’t come to an end).

A manual restart of the osd-daemons resolved the problem and the cluster is now active+clean again. 

This behavior is reproducible.

The “ceph log last cephadm” command spits out (redacted):

2020-03-30T23:07:06.881061+0000 mgr.ceph2 (mgr.1854484) 42 : cephadm [INF] Generating ssh key...
2020-03-30T23:22:00.250422+0000 mgr.ceph2 (mgr.1854484) 492 : cephadm [ERR] _Promise failed
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/module.py", line 444, in do_work
    res = self._on_complete_(*args, **kwargs)
  File "/usr/share/ceph/mgr/cephadm/module.py", line 512, in <lambda>
    return cls(_on_complete_=lambda x: f(*x), value=args, name=name, **c_kwargs)
  File "/usr/share/ceph/mgr/cephadm/module.py", line 1648, in add_host
    spec.hostname, spec.addr, err))
orchestrator._interface.OrchestratorError: New host ceph1 (ceph1) failed check: ['INFO:cephadm:podman|docker (/usr/bin/docker) is present', 'INFO:cephadm:systemctl is present', 'INFO:cephadm:lvcreate is present', 'INFO:cephadm:Unit systemd-timesyncd.service is enabled and running', 'ERROR: hostname "ceph1.domain.de" does not match expected hostname "ceph1"']
2020-03-30T23:22:27.267344+0000 mgr.ceph2 (mgr.1854484) 508 : cephadm [INF] Added host ceph1.domain.de
2020-03-30T23:22:36.078462+0000 mgr.ceph2 (mgr.1854484) 515 : cephadm [INF] Added host ceph2.domain.de
2020-03-30T23:22:55.200280+0000 mgr.ceph2 (mgr.1854484) 527 : cephadm [INF] Added host ceph3.domain.de
2020-03-30T23:23:17.491596+0000 mgr.ceph2 (mgr.1854484) 540 : cephadm [ERR] _Promise failed
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/module.py", line 444, in do_work
    res = self._on_complete_(*args, **kwargs)
  File "/usr/share/ceph/mgr/cephadm/module.py", line 512, in <lambda>
    return cls(_on_complete_=lambda x: f(*x), value=args, name=name, **c_kwargs)
  File "/usr/share/ceph/mgr/cephadm/module.py", line 1648, in add_host
    spec.hostname, spec.addr, err))
orchestrator._interface.OrchestratorError: New host ceph1 (10.10.0.10) failed check: ['INFO:cephadm:podman|docker (/usr/bin/docker) is present', 'INFO:cephadm:systemctl is present', 'INFO:cephadm:lvcreate is present', 'INFO:cephadm:Unit systemd-timesyncd.service is enabled and running', 'ERROR: hostname "ceph1.domain.de" does not match expected hostname "ceph1"']

Could this be a problem with the ssh key?

Thanks for the help and happy eastern.

Marco Savoca

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx