Hi, I deployed a Ceph cluster a week ago and have started experiencing warnings. Any pointers as to how to further debug or fix it? Here is info about the warnings: # ceph version ceph version 17.2.4 (1353ed37dec8d74973edc3d5d5908c20ad5a7332) quincy (stable) # ceph status cluster: id: <removed> health: HEALTH_WARN 1 hosts fail cephadm check services: mon: 5 daemons, quorum admin.ceph.<removed>,mon,osd1,osd2,osd3 (age 79m) mgr: admin.ceph.<removed>.wvhmky(active, since 2h), standbys: mon.jzfopv osd: 4 osds: 4 up (since 3h), 4 in (since 3h) rbd-mirror: 2 daemons active (2 hosts) rgw: 5 daemons active (5 hosts, 1 zones) data: pools: 9 pools, 226 pgs objects: 736 objects, 1.4 GiB usage: 7.3 GiB used, 2.0 TiB / 2.1 TiB avail pgs: 226 active+clean io: client: 36 KiB/s rd, 19 KiB/s wr, 35 op/s rd, 26 op/s wr # journalctl -u ceph-<removed>@mgr.admin.ceph.<removed>.wvhmky | grep "cephadm ERROR" Oct 19 13:45:08 admin.ceph.<removed> bash[4445]: debug 2022-10-19T11:45:08.163+0000 7fa7afabc700 0 [cephadm ERROR cephadm.ssh] Unable to write admin.ceph.<removed>:/var/lib/ceph/1ea33904-4bad-11ed-b842-2d991d088396/config/ceph.conf: Unable to reach remote host admin.ceph.<removed>. Oct 19 13:45:08 admin.ceph.<removed> bash[4445]: debug 2022-10-19T11:45:08.167+0000 7fa7bb2d3700 0 [cephadm ERROR cephadm.utils] executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>', 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],)) failed. Oct 19 21:16:37 admin.ceph.<removed> bash[4445]: debug 2022-10-19T19:16:37.504+0000 7fa7ba2d1700 0 [cephadm ERROR cephadm.utils] executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>', 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],)) failed. Oct 21 14:00:52 admin.ceph.<removed> bash[4445]: debug 2022-10-21T12:00:52.035+0000 7fa7afabc700 0 [cephadm ERROR cephadm.ssh] Unable to write admin.ceph.<removed>:/var/lib/ceph/1ea33904-4bad-11ed-b842-2d991d088396/config/ceph.conf: Unable to reach remote host admin.ceph.<removed>. Oct 21 14:00:52 admin.ceph.<removed> bash[4445]: debug 2022-10-21T12:00:52.047+0000 7fa7bc2d5700 0 [cephadm ERROR cephadm.utils] executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>', 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],)) failed. Oct 21 14:25:04 admin.ceph.<removed> bash[4445]: debug 2022-10-21T12:25:03.994+0000 7fa7bc2d5700 0 [cephadm ERROR cephadm.utils] executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>', 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],)) failed. Oct 21 16:03:48 admin.ceph.<removed> bash[4445]: debug 2022-10-21T14:03:48.320+0000 7fa7ba2d1700 0 [cephadm ERROR cephadm.utils] executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>', 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],)) failed. Oct 22 06:26:17 admin.ceph.<removed> bash[4445]: debug 2022-10-22T04:26:17.051+0000 7fa7afabc700 0 [cephadm ERROR cephadm.ssh] Unable to write admin.ceph.<removed>:/etc/ceph/ceph.client.admin.keyring: Unable to reach remote host admin.ceph.<removed>. Oct 22 06:26:17 admin.ceph.<removed> bash[4445]: debug 2022-10-22T04:26:17.055+0000 7fa7b8ace700 0 [cephadm ERROR cephadm.utils] executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>', 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],)) failed. ... Continues to this day # journalctl -u ceph-<removed>@mgr.admin.ceph.<removed>.wvhmky | grep "auth: could not find secret_id" Oct 19 16:52:48 admin.ceph.<removed> bash[4445]: debug 2022-10-19T14:52:48.789+0000 7fa7f3120700 0 auth: could not find secret_id=123 Oct 19 16:52:48 admin.ceph.<removed> bash[4445]: debug 2022-10-19T14:52:48.989+0000 7fa7f3120700 0 auth: could not find secret_id=123 Oct 19 16:52:49 admin.ceph.<removed> bash[4445]: debug 2022-10-19T14:52:49.393+0000 7fa7f3120700 0 auth: could not find secret_id=123 Oct 19 16:52:50 admin.ceph.<removed> bash[4445]: debug 2022-10-19T14:52:50.197+0000 7fa7f3120700 0 auth: could not find secret_id=123 ... Continues to this day # journalctl -u ceph-<removed>@mgr.admin.ceph.<removed>.wvhmky | grep "Is a directory" Oct 24 11:12:53 admin.ceph.<removed> bash[4445]: orchestrator._interface.OrchestratorError: Command ['rm', '-f', '/etc/ceph/ceph.client.admin.keyring'] failed. rm: cannot remove '/etc/ceph/ceph.client.admin.keyring': Is a directory ... Continues to this day # ceph orch host ls HOST ADDR LABELS STATUS admin.ceph.<removed> 10.0.0.<R> _admin mon.ceph.<removed> 10.0.0.<R> mon Offline osd1.ceph.<removed> 10.0.0.<R> osd1 osd2.ceph.<removed> 10.0.0.<R> osd2 Offline osd3.ceph.<removed> 10.0.0.<R> osd3 osd4.ceph.<removed> 10.0.0.<R> osd4 Offline 6 hosts in cluster Logs: 10/24/22 2:19:41 PM [INF] Cluster is now healthy 10/24/22 2:19:41 PM [INF] Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons or devices) 10/24/22 2:18:33 PM [WRN] Health check failed: failed to probe daemons or devices (CEPHADM_REFRESH_FAILED) 10/24/22 2:15:24 PM [INF] Cluster is now healthy 10/24/22 2:15:24 PM [INF] Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons or devices) 10/24/22 2:15:24 PM [INF] Health check cleared: CEPHADM_HOST_CHECK_FAILED (was: 1 hosts fail cephadm check) 10/24/22 2:13:10 PM [WRN] Health check failed: failed to probe daemons or devices (CEPHADM_REFRESH_FAILED) 10/24/22 2:11:55 PM [WRN] Health check failed: 1 hosts fail cephadm check (CEPHADM_HOST_CHECK_FAILED) 10/24/22 2:10:00 PM [INF] overall HEALTH_OK 10/24/22 2:08:47 PM [INF] Cluster is now healthy 10/24/22 2:08:47 PM [INF] Health check cleared: CEPHADM_HOST_CHECK_FAILED (was: 1 hosts fail cephadm check) 10/24/22 2:07:39 PM [WRN] Health check failed: 1 hosts fail cephadm check (CEPHADM_HOST_CHECK_FAILED) 10/24/22 2:03:25 PM [INF] Cluster is now healthy 10/24/22 2:03:25 PM [INF] Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons or devices) 10/24/22 2:02:15 PM [ERR] executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>', 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>', 'osd4.ceph.<removed>'],)) failed. Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/utils.py", line 78, in do_work return f(*arg) File "/usr/share/ceph/mgr/cephadm/serve.py", line 271, in refresh self._write_client_files(client_files, host) File "/usr/share/ceph/mgr/cephadm/serve.py", line 1088, in _write_client_files self.mgr.ssh.check_execute_command(host, cmd) File "/usr/share/ceph/mgr/cephadm/ssh.py", line 196, in check_execute_command return self.mgr.wait_async(self._check_execute_command(host, cmd, stdin, addr)) File "/usr/share/ceph/mgr/cephadm/module.py", line 597, in wait_async return self.event_loop.get_result(coro) File "/usr/share/ceph/mgr/cephadm/ssh.py", line 48, in get_result return asyncio.run_coroutine_threadsafe(coro, self._loop).result() File "/lib64/python3.6/concurrent/futures/_base.py", line 432, in result return self.__get_result() File "/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result raise self._exception File "/usr/share/ceph/mgr/cephadm/ssh.py", line 187, in _check_execute_command raise OrchestratorError(msg) orchestrator._interface.OrchestratorError: Command ['rm', '-f', '/etc/ceph/ceph.client.admin.keyring'] failed. rm: cannot remove '/etc/ceph/ceph.client.admin.keyring': Is a directory 10/24/22 2:02:15 PM [INF] Removing admin.ceph.<removed>:/etc/ceph/ceph.client.admin.keyring 10/24/22 2:01:06 PM [WRN] Health check failed: failed to probe daemons or devices (CEPHADM_REFRESH_FAILED) 10/24/22 2:00:00 PM [INF] overall HEALTH_OK 10/24/22 1:57:54 PM [INF] Cluster is now healthy 10/24/22 1:57:54 PM [INF] Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons or devices) 10/24/22 1:57:54 PM [INF] Health check cleared: CEPHADM_HOST_CHECK_FAILED (was: 1 hosts fail cephadm check) 10/24/22 1:56:38 PM [WRN] Health check failed: failed to probe daemons or devices (CEPHADM_REFRESH_FAILED) 10/24/22 1:56:38 PM [WRN] Health check failed: 1 hosts fail cephadm check (CEPHADM_HOST_CHECK_FAILED) 10/24/22 1:52:18 PM [INF] Cluster is now healthy ------------------------------- These statuses go offline and online sporadically. The block devices seem to be working fine all along. The cluster alternates between HEALTH_OK and HEALTH_WARN Best Regards, Martin Johansen _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx