Hello Martin. Apparently cephadm is not able to resolve to `admin.ceph.<removed>`, check /etc/hosts or your DNS, try to ping and check if the IPs in `ceph orch host ls` are pinged and there is no packet loss. Try according to the documentation: https://docs.ceph.com/en/quincy/cephadm/operations/#cephadm-host-check-failed Em seg., 24 de out. de 2022 às 09:23, Martin Johansen <martin@xxxxxxxxx> escreveu: > Hi, I deployed a Ceph cluster a week ago and have started experiencing > warnings. Any pointers as to how to further debug or fix it? Here is info > about the warnings: > > # ceph version > ceph version 17.2.4 (1353ed37dec8d74973edc3d5d5908c20ad5a7332) quincy > (stable) > > # ceph status > cluster: > id: <removed> > health: HEALTH_WARN > 1 hosts fail cephadm check > > services: > mon: 5 daemons, quorum admin.ceph.<removed>,mon,osd1,osd2,osd3 > (age 79m) > mgr: admin.ceph.<removed>.wvhmky(active, since 2h), standbys: > mon.jzfopv > osd: 4 osds: 4 up (since 3h), 4 in (since 3h) > rbd-mirror: 2 daemons active (2 hosts) > rgw: 5 daemons active (5 hosts, 1 zones) > > data: > pools: 9 pools, 226 pgs > objects: 736 objects, 1.4 GiB > usage: 7.3 GiB used, 2.0 TiB / 2.1 TiB avail > pgs: 226 active+clean > > io: > client: 36 KiB/s rd, 19 KiB/s wr, 35 op/s rd, 26 op/s wr > > # journalctl -u ceph-<removed>@mgr.admin.ceph.<removed>.wvhmky | grep > "cephadm ERROR" > Oct 19 13:45:08 admin.ceph.<removed> bash[4445]: debug > 2022-10-19T11:45:08.163+0000 7fa7afabc700 0 [cephadm ERROR cephadm.ssh] > Unable to write > > admin.ceph.<removed>:/var/lib/ceph/1ea33904-4bad-11ed-b842-2d991d088396/config/ceph.conf: > Unable to reach remote host admin.ceph.<removed>. > Oct 19 13:45:08 admin.ceph.<removed> bash[4445]: debug > 2022-10-19T11:45:08.167+0000 7fa7bb2d3700 0 [cephadm ERROR cephadm.utils] > executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>', > 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],)) > failed. > Oct 19 21:16:37 admin.ceph.<removed> bash[4445]: debug > 2022-10-19T19:16:37.504+0000 7fa7ba2d1700 0 [cephadm ERROR cephadm.utils] > executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>', > 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],)) > failed. > Oct 21 14:00:52 admin.ceph.<removed> bash[4445]: debug > 2022-10-21T12:00:52.035+0000 7fa7afabc700 0 [cephadm ERROR cephadm.ssh] > Unable to write > > admin.ceph.<removed>:/var/lib/ceph/1ea33904-4bad-11ed-b842-2d991d088396/config/ceph.conf: > Unable to reach remote host admin.ceph.<removed>. > Oct 21 14:00:52 admin.ceph.<removed> bash[4445]: debug > 2022-10-21T12:00:52.047+0000 7fa7bc2d5700 0 [cephadm ERROR cephadm.utils] > executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>', > 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],)) > failed. > Oct 21 14:25:04 admin.ceph.<removed> bash[4445]: debug > 2022-10-21T12:25:03.994+0000 7fa7bc2d5700 0 [cephadm ERROR cephadm.utils] > executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>', > 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],)) > failed. > Oct 21 16:03:48 admin.ceph.<removed> bash[4445]: debug > 2022-10-21T14:03:48.320+0000 7fa7ba2d1700 0 [cephadm ERROR cephadm.utils] > executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>', > 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],)) > failed. > Oct 22 06:26:17 admin.ceph.<removed> bash[4445]: debug > 2022-10-22T04:26:17.051+0000 7fa7afabc700 0 [cephadm ERROR cephadm.ssh] > Unable to write admin.ceph.<removed>:/etc/ceph/ceph.client.admin.keyring: > Unable to reach remote host admin.ceph.<removed>. > Oct 22 06:26:17 admin.ceph.<removed> bash[4445]: debug > 2022-10-22T04:26:17.055+0000 7fa7b8ace700 0 [cephadm ERROR cephadm.utils] > executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>', > 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],)) > failed. > ... Continues to this day > > # journalctl -u ceph-<removed>@mgr.admin.ceph.<removed>.wvhmky | grep > "auth: could not find secret_id" > Oct 19 16:52:48 admin.ceph.<removed> bash[4445]: debug > 2022-10-19T14:52:48.789+0000 7fa7f3120700 0 auth: could not find > secret_id=123 > Oct 19 16:52:48 admin.ceph.<removed> bash[4445]: debug > 2022-10-19T14:52:48.989+0000 7fa7f3120700 0 auth: could not find > secret_id=123 > Oct 19 16:52:49 admin.ceph.<removed> bash[4445]: debug > 2022-10-19T14:52:49.393+0000 7fa7f3120700 0 auth: could not find > secret_id=123 > Oct 19 16:52:50 admin.ceph.<removed> bash[4445]: debug > 2022-10-19T14:52:50.197+0000 7fa7f3120700 0 auth: could not find > secret_id=123 > ... Continues to this day > > # journalctl -u ceph-<removed>@mgr.admin.ceph.<removed>.wvhmky | grep "Is a > directory" > Oct 24 11:12:53 admin.ceph.<removed> bash[4445]: > orchestrator._interface.OrchestratorError: Command ['rm', '-f', > '/etc/ceph/ceph.client.admin.keyring'] failed. rm: cannot remove > '/etc/ceph/ceph.client.admin.keyring': Is a directory > ... Continues to this day > > # ceph orch host ls > HOST ADDR LABELS STATUS > admin.ceph.<removed> 10.0.0.<R> _admin > mon.ceph.<removed> 10.0.0.<R> mon Offline > osd1.ceph.<removed> 10.0.0.<R> osd1 > osd2.ceph.<removed> 10.0.0.<R> osd2 Offline > osd3.ceph.<removed> 10.0.0.<R> osd3 > osd4.ceph.<removed> 10.0.0.<R> osd4 Offline > 6 hosts in cluster > > Logs: > > 10/24/22 2:19:41 PM > [INF] > Cluster is now healthy > > 10/24/22 2:19:41 PM > [INF] > Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons > or devices) > > 10/24/22 2:18:33 PM > [WRN] > Health check failed: failed to probe daemons or devices > (CEPHADM_REFRESH_FAILED) > > 10/24/22 2:15:24 PM > [INF] > Cluster is now healthy > > 10/24/22 2:15:24 PM > [INF] > Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons > or devices) > > 10/24/22 2:15:24 PM > [INF] > Health check cleared: CEPHADM_HOST_CHECK_FAILED (was: 1 hosts fail cephadm > check) > > 10/24/22 2:13:10 PM > [WRN] > Health check failed: failed to probe daemons or devices > (CEPHADM_REFRESH_FAILED) > > 10/24/22 2:11:55 PM > [WRN] > Health check failed: 1 hosts fail cephadm check (CEPHADM_HOST_CHECK_FAILED) > > 10/24/22 2:10:00 PM > [INF] > overall HEALTH_OK > > 10/24/22 2:08:47 PM > [INF] > Cluster is now healthy > > 10/24/22 2:08:47 PM > [INF] > Health check cleared: CEPHADM_HOST_CHECK_FAILED (was: 1 hosts fail cephadm > check) > > 10/24/22 2:07:39 PM > [WRN] > Health check failed: 1 hosts fail cephadm check (CEPHADM_HOST_CHECK_FAILED) > > 10/24/22 2:03:25 PM > [INF] > Cluster is now healthy > > 10/24/22 2:03:25 PM > [INF] > Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons > or devices) > > 10/24/22 2:02:15 PM > [ERR] > executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>', > 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>', > 'osd4.ceph.<removed>'],)) failed. Traceback (most recent call last): File > "/usr/share/ceph/mgr/cephadm/utils.py", line 78, in do_work return f(*arg) > File "/usr/share/ceph/mgr/cephadm/serve.py", line 271, in refresh > self._write_client_files(client_files, host) File > "/usr/share/ceph/mgr/cephadm/serve.py", line 1088, in _write_client_files > self.mgr.ssh.check_execute_command(host, cmd) File > "/usr/share/ceph/mgr/cephadm/ssh.py", line 196, in check_execute_command > return self.mgr.wait_async(self._check_execute_command(host, cmd, stdin, > addr)) File "/usr/share/ceph/mgr/cephadm/module.py", line 597, in > wait_async return self.event_loop.get_result(coro) File > "/usr/share/ceph/mgr/cephadm/ssh.py", line 48, in get_result return > asyncio.run_coroutine_threadsafe(coro, self._loop).result() File > "/lib64/python3.6/concurrent/futures/_base.py", line 432, in result return > self.__get_result() File "/lib64/python3.6/concurrent/futures/_base.py", > line 384, in __get_result raise self._exception File > "/usr/share/ceph/mgr/cephadm/ssh.py", line 187, in _check_execute_command > raise OrchestratorError(msg) orchestrator._interface.OrchestratorError: > Command ['rm', '-f', '/etc/ceph/ceph.client.admin.keyring'] failed. rm: > cannot remove '/etc/ceph/ceph.client.admin.keyring': Is a directory > > 10/24/22 2:02:15 PM > [INF] > Removing admin.ceph.<removed>:/etc/ceph/ceph.client.admin.keyring > > 10/24/22 2:01:06 PM > [WRN] > Health check failed: failed to probe daemons or devices > (CEPHADM_REFRESH_FAILED) > > 10/24/22 2:00:00 PM > [INF] > overall HEALTH_OK > > 10/24/22 1:57:54 PM > [INF] > Cluster is now healthy > > 10/24/22 1:57:54 PM > [INF] > Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons > or devices) > > 10/24/22 1:57:54 PM > [INF] > Health check cleared: CEPHADM_HOST_CHECK_FAILED (was: 1 hosts fail cephadm > check) > > 10/24/22 1:56:38 PM > [WRN] > Health check failed: failed to probe daemons or devices > (CEPHADM_REFRESH_FAILED) > > 10/24/22 1:56:38 PM > [WRN] > Health check failed: 1 hosts fail cephadm check (CEPHADM_HOST_CHECK_FAILED) > > 10/24/22 1:52:18 PM > [INF] > Cluster is now healthy > > ------------------------------- > > These statuses go offline and online sporadically. The block devices seem > to be working fine all along. The cluster alternates between HEALTH_OK > and HEALTH_WARN > > Best Regards, > > Martin Johansen > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx