Hi, thank you, we replaced the domain of the service in text before reporting the issue. Sorry, I should have mentioned. admin.ceph.example.com was turned into admin.ceph.<removed> for privacy sake. Best Regards, Martin Johansen On Mon, Oct 24, 2022 at 2:53 PM Murilo Morais <murilo@xxxxxxxxxxxxxx> wrote: > Hello Martin. > > Apparently cephadm is not able to resolve to `admin.ceph.<removed>`, check > /etc/hosts or your DNS, try to ping and check if the IPs in `ceph orch host > ls` are pinged and there is no packet loss. > > Try according to the documentation: > > https://docs.ceph.com/en/quincy/cephadm/operations/#cephadm-host-check-failed > > Em seg., 24 de out. de 2022 às 09:23, Martin Johansen <martin@xxxxxxxxx> > escreveu: > >> Hi, I deployed a Ceph cluster a week ago and have started experiencing >> warnings. Any pointers as to how to further debug or fix it? Here is info >> about the warnings: >> >> # ceph version >> ceph version 17.2.4 (1353ed37dec8d74973edc3d5d5908c20ad5a7332) quincy >> (stable) >> >> # ceph status >> cluster: >> id: <removed> >> health: HEALTH_WARN >> 1 hosts fail cephadm check >> >> services: >> mon: 5 daemons, quorum admin.ceph.<removed>,mon,osd1,osd2,osd3 >> (age 79m) >> mgr: admin.ceph.<removed>.wvhmky(active, since 2h), standbys: >> mon.jzfopv >> osd: 4 osds: 4 up (since 3h), 4 in (since 3h) >> rbd-mirror: 2 daemons active (2 hosts) >> rgw: 5 daemons active (5 hosts, 1 zones) >> >> data: >> pools: 9 pools, 226 pgs >> objects: 736 objects, 1.4 GiB >> usage: 7.3 GiB used, 2.0 TiB / 2.1 TiB avail >> pgs: 226 active+clean >> >> io: >> client: 36 KiB/s rd, 19 KiB/s wr, 35 op/s rd, 26 op/s wr >> >> # journalctl -u ceph-<removed>@mgr.admin.ceph.<removed>.wvhmky | grep >> "cephadm ERROR" >> Oct 19 13:45:08 admin.ceph.<removed> bash[4445]: debug >> 2022-10-19T11:45:08.163+0000 7fa7afabc700 0 [cephadm ERROR cephadm.ssh] >> Unable to write >> >> admin.ceph.<removed>:/var/lib/ceph/1ea33904-4bad-11ed-b842-2d991d088396/config/ceph.conf: >> Unable to reach remote host admin.ceph.<removed>. >> Oct 19 13:45:08 admin.ceph.<removed> bash[4445]: debug >> 2022-10-19T11:45:08.167+0000 7fa7bb2d3700 0 [cephadm ERROR cephadm.utils] >> executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>', >> 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],)) >> failed. >> Oct 19 21:16:37 admin.ceph.<removed> bash[4445]: debug >> 2022-10-19T19:16:37.504+0000 7fa7ba2d1700 0 [cephadm ERROR cephadm.utils] >> executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>', >> 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],)) >> failed. >> Oct 21 14:00:52 admin.ceph.<removed> bash[4445]: debug >> 2022-10-21T12:00:52.035+0000 7fa7afabc700 0 [cephadm ERROR cephadm.ssh] >> Unable to write >> >> admin.ceph.<removed>:/var/lib/ceph/1ea33904-4bad-11ed-b842-2d991d088396/config/ceph.conf: >> Unable to reach remote host admin.ceph.<removed>. >> Oct 21 14:00:52 admin.ceph.<removed> bash[4445]: debug >> 2022-10-21T12:00:52.047+0000 7fa7bc2d5700 0 [cephadm ERROR cephadm.utils] >> executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>', >> 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],)) >> failed. >> Oct 21 14:25:04 admin.ceph.<removed> bash[4445]: debug >> 2022-10-21T12:25:03.994+0000 7fa7bc2d5700 0 [cephadm ERROR cephadm.utils] >> executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>', >> 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],)) >> failed. >> Oct 21 16:03:48 admin.ceph.<removed> bash[4445]: debug >> 2022-10-21T14:03:48.320+0000 7fa7ba2d1700 0 [cephadm ERROR cephadm.utils] >> executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>', >> 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],)) >> failed. >> Oct 22 06:26:17 admin.ceph.<removed> bash[4445]: debug >> 2022-10-22T04:26:17.051+0000 7fa7afabc700 0 [cephadm ERROR cephadm.ssh] >> Unable to write admin.ceph.<removed>:/etc/ceph/ceph.client.admin.keyring: >> Unable to reach remote host admin.ceph.<removed>. >> Oct 22 06:26:17 admin.ceph.<removed> bash[4445]: debug >> 2022-10-22T04:26:17.055+0000 7fa7b8ace700 0 [cephadm ERROR cephadm.utils] >> executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>', >> 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],)) >> failed. >> ... Continues to this day >> >> # journalctl -u ceph-<removed>@mgr.admin.ceph.<removed>.wvhmky | grep >> "auth: could not find secret_id" >> Oct 19 16:52:48 admin.ceph.<removed> bash[4445]: debug >> 2022-10-19T14:52:48.789+0000 7fa7f3120700 0 auth: could not find >> secret_id=123 >> Oct 19 16:52:48 admin.ceph.<removed> bash[4445]: debug >> 2022-10-19T14:52:48.989+0000 7fa7f3120700 0 auth: could not find >> secret_id=123 >> Oct 19 16:52:49 admin.ceph.<removed> bash[4445]: debug >> 2022-10-19T14:52:49.393+0000 7fa7f3120700 0 auth: could not find >> secret_id=123 >> Oct 19 16:52:50 admin.ceph.<removed> bash[4445]: debug >> 2022-10-19T14:52:50.197+0000 7fa7f3120700 0 auth: could not find >> secret_id=123 >> ... Continues to this day >> >> # journalctl -u ceph-<removed>@mgr.admin.ceph.<removed>.wvhmky | grep "Is >> a >> directory" >> Oct 24 11:12:53 admin.ceph.<removed> bash[4445]: >> orchestrator._interface.OrchestratorError: Command ['rm', '-f', >> '/etc/ceph/ceph.client.admin.keyring'] failed. rm: cannot remove >> '/etc/ceph/ceph.client.admin.keyring': Is a directory >> ... Continues to this day >> >> # ceph orch host ls >> HOST ADDR LABELS STATUS >> admin.ceph.<removed> 10.0.0.<R> _admin >> mon.ceph.<removed> 10.0.0.<R> mon Offline >> osd1.ceph.<removed> 10.0.0.<R> osd1 >> osd2.ceph.<removed> 10.0.0.<R> osd2 Offline >> osd3.ceph.<removed> 10.0.0.<R> osd3 >> osd4.ceph.<removed> 10.0.0.<R> osd4 Offline >> 6 hosts in cluster >> >> Logs: >> >> 10/24/22 2:19:41 PM >> [INF] >> Cluster is now healthy >> >> 10/24/22 2:19:41 PM >> [INF] >> Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons >> or devices) >> >> 10/24/22 2:18:33 PM >> [WRN] >> Health check failed: failed to probe daemons or devices >> (CEPHADM_REFRESH_FAILED) >> >> 10/24/22 2:15:24 PM >> [INF] >> Cluster is now healthy >> >> 10/24/22 2:15:24 PM >> [INF] >> Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons >> or devices) >> >> 10/24/22 2:15:24 PM >> [INF] >> Health check cleared: CEPHADM_HOST_CHECK_FAILED (was: 1 hosts fail cephadm >> check) >> >> 10/24/22 2:13:10 PM >> [WRN] >> Health check failed: failed to probe daemons or devices >> (CEPHADM_REFRESH_FAILED) >> >> 10/24/22 2:11:55 PM >> [WRN] >> Health check failed: 1 hosts fail cephadm check >> (CEPHADM_HOST_CHECK_FAILED) >> >> 10/24/22 2:10:00 PM >> [INF] >> overall HEALTH_OK >> >> 10/24/22 2:08:47 PM >> [INF] >> Cluster is now healthy >> >> 10/24/22 2:08:47 PM >> [INF] >> Health check cleared: CEPHADM_HOST_CHECK_FAILED (was: 1 hosts fail cephadm >> check) >> >> 10/24/22 2:07:39 PM >> [WRN] >> Health check failed: 1 hosts fail cephadm check >> (CEPHADM_HOST_CHECK_FAILED) >> >> 10/24/22 2:03:25 PM >> [INF] >> Cluster is now healthy >> >> 10/24/22 2:03:25 PM >> [INF] >> Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons >> or devices) >> >> 10/24/22 2:02:15 PM >> [ERR] >> executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>', >> 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>', >> 'osd4.ceph.<removed>'],)) failed. Traceback (most recent call last): File >> "/usr/share/ceph/mgr/cephadm/utils.py", line 78, in do_work return f(*arg) >> File "/usr/share/ceph/mgr/cephadm/serve.py", line 271, in refresh >> self._write_client_files(client_files, host) File >> "/usr/share/ceph/mgr/cephadm/serve.py", line 1088, in _write_client_files >> self.mgr.ssh.check_execute_command(host, cmd) File >> "/usr/share/ceph/mgr/cephadm/ssh.py", line 196, in check_execute_command >> return self.mgr.wait_async(self._check_execute_command(host, cmd, stdin, >> addr)) File "/usr/share/ceph/mgr/cephadm/module.py", line 597, in >> wait_async return self.event_loop.get_result(coro) File >> "/usr/share/ceph/mgr/cephadm/ssh.py", line 48, in get_result return >> asyncio.run_coroutine_threadsafe(coro, self._loop).result() File >> "/lib64/python3.6/concurrent/futures/_base.py", line 432, in result return >> self.__get_result() File "/lib64/python3.6/concurrent/futures/_base.py", >> line 384, in __get_result raise self._exception File >> "/usr/share/ceph/mgr/cephadm/ssh.py", line 187, in _check_execute_command >> raise OrchestratorError(msg) orchestrator._interface.OrchestratorError: >> Command ['rm', '-f', '/etc/ceph/ceph.client.admin.keyring'] failed. rm: >> cannot remove '/etc/ceph/ceph.client.admin.keyring': Is a directory >> >> 10/24/22 2:02:15 PM >> [INF] >> Removing admin.ceph.<removed>:/etc/ceph/ceph.client.admin.keyring >> >> 10/24/22 2:01:06 PM >> [WRN] >> Health check failed: failed to probe daemons or devices >> (CEPHADM_REFRESH_FAILED) >> >> 10/24/22 2:00:00 PM >> [INF] >> overall HEALTH_OK >> >> 10/24/22 1:57:54 PM >> [INF] >> Cluster is now healthy >> >> 10/24/22 1:57:54 PM >> [INF] >> Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons >> or devices) >> >> 10/24/22 1:57:54 PM >> [INF] >> Health check cleared: CEPHADM_HOST_CHECK_FAILED (was: 1 hosts fail cephadm >> check) >> >> 10/24/22 1:56:38 PM >> [WRN] >> Health check failed: failed to probe daemons or devices >> (CEPHADM_REFRESH_FAILED) >> >> 10/24/22 1:56:38 PM >> [WRN] >> Health check failed: 1 hosts fail cephadm check >> (CEPHADM_HOST_CHECK_FAILED) >> >> 10/24/22 1:52:18 PM >> [INF] >> Cluster is now healthy >> >> ------------------------------- >> >> These statuses go offline and online sporadically. The block devices seem >> to be working fine all along. The cluster alternates between HEALTH_OK >> and HEALTH_WARN >> >> Best Regards, >> >> Martin Johansen >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx