Debug cluster warnings "CEPHADM_HOST_CHECK_FAILED", "CEPHADM_REFRESH_FAILED" etc

Martin Johansen <martin@xxxxxxxxx> · Mon, 24 Oct 2022 14:21:44 +0200

Hi, I deployed a Ceph cluster a week ago and have started experiencing
warnings. Any pointers as to how to further debug or fix it? Here is info
about the warnings:

# ceph version
ceph version 17.2.4 (1353ed37dec8d74973edc3d5d5908c20ad5a7332) quincy
(stable)

# ceph status
  cluster:
    id:     <removed>
    health: HEALTH_WARN
            1 hosts fail cephadm check

  services:
    mon:        5 daemons, quorum admin.ceph.<removed>,mon,osd1,osd2,osd3
(age 79m)
    mgr:        admin.ceph.<removed>.wvhmky(active, since 2h), standbys:
mon.jzfopv
    osd:        4 osds: 4 up (since 3h), 4 in (since 3h)
    rbd-mirror: 2 daemons active (2 hosts)
    rgw:        5 daemons active (5 hosts, 1 zones)

  data:
    pools:   9 pools, 226 pgs
    objects: 736 objects, 1.4 GiB
    usage:   7.3 GiB used, 2.0 TiB / 2.1 TiB avail
    pgs:     226 active+clean

  io:
    client:   36 KiB/s rd, 19 KiB/s wr, 35 op/s rd, 26 op/s wr

# journalctl -u ceph-<removed>@mgr.admin.ceph.<removed>.wvhmky | grep
"cephadm ERROR"
Oct 19 13:45:08 admin.ceph.<removed> bash[4445]: debug
2022-10-19T11:45:08.163+0000 7fa7afabc700  0 [cephadm ERROR cephadm.ssh]
Unable to write
admin.ceph.<removed>:/var/lib/ceph/1ea33904-4bad-11ed-b842-2d991d088396/config/ceph.conf:
Unable to reach remote host admin.ceph.<removed>.
Oct 19 13:45:08 admin.ceph.<removed> bash[4445]: debug
2022-10-19T11:45:08.167+0000 7fa7bb2d3700  0 [cephadm ERROR cephadm.utils]
executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>',
'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],))
failed.
Oct 19 21:16:37 admin.ceph.<removed> bash[4445]: debug
2022-10-19T19:16:37.504+0000 7fa7ba2d1700  0 [cephadm ERROR cephadm.utils]
executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>',
'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],))
failed.
Oct 21 14:00:52 admin.ceph.<removed> bash[4445]: debug
2022-10-21T12:00:52.035+0000 7fa7afabc700  0 [cephadm ERROR cephadm.ssh]
Unable to write
admin.ceph.<removed>:/var/lib/ceph/1ea33904-4bad-11ed-b842-2d991d088396/config/ceph.conf:
Unable to reach remote host admin.ceph.<removed>.
Oct 21 14:00:52 admin.ceph.<removed> bash[4445]: debug
2022-10-21T12:00:52.047+0000 7fa7bc2d5700  0 [cephadm ERROR cephadm.utils]
executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>',
'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],))
failed.
Oct 21 14:25:04 admin.ceph.<removed> bash[4445]: debug
2022-10-21T12:25:03.994+0000 7fa7bc2d5700  0 [cephadm ERROR cephadm.utils]
executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>',
'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],))
failed.
Oct 21 16:03:48 admin.ceph.<removed> bash[4445]: debug
2022-10-21T14:03:48.320+0000 7fa7ba2d1700  0 [cephadm ERROR cephadm.utils]
executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>',
'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],))
failed.
Oct 22 06:26:17 admin.ceph.<removed> bash[4445]: debug
2022-10-22T04:26:17.051+0000 7fa7afabc700  0 [cephadm ERROR cephadm.ssh]
Unable to write admin.ceph.<removed>:/etc/ceph/ceph.client.admin.keyring:
Unable to reach remote host admin.ceph.<removed>.
Oct 22 06:26:17 admin.ceph.<removed> bash[4445]: debug
2022-10-22T04:26:17.055+0000 7fa7b8ace700  0 [cephadm ERROR cephadm.utils]
executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>',
'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],))
failed.
... Continues to this day

# journalctl -u ceph-<removed>@mgr.admin.ceph.<removed>.wvhmky | grep
"auth: could not find secret_id"
Oct 19 16:52:48 admin.ceph.<removed> bash[4445]: debug
2022-10-19T14:52:48.789+0000 7fa7f3120700  0 auth: could not find
secret_id=123
Oct 19 16:52:48 admin.ceph.<removed> bash[4445]: debug
2022-10-19T14:52:48.989+0000 7fa7f3120700  0 auth: could not find
secret_id=123
Oct 19 16:52:49 admin.ceph.<removed> bash[4445]: debug
2022-10-19T14:52:49.393+0000 7fa7f3120700  0 auth: could not find
secret_id=123
Oct 19 16:52:50 admin.ceph.<removed> bash[4445]: debug
2022-10-19T14:52:50.197+0000 7fa7f3120700  0 auth: could not find
secret_id=123
... Continues to this day

# journalctl -u ceph-<removed>@mgr.admin.ceph.<removed>.wvhmky | grep "Is a
directory"
Oct 24 11:12:53 admin.ceph.<removed> bash[4445]:
orchestrator._interface.OrchestratorError: Command ['rm', '-f',
'/etc/ceph/ceph.client.admin.keyring'] failed. rm: cannot remove
'/etc/ceph/ceph.client.admin.keyring': Is a directory
... Continues to this day

# ceph orch host ls
HOST                  ADDR        LABELS  STATUS
admin.ceph.<removed>  10.0.0.<R>  _admin
mon.ceph.<removed>    10.0.0.<R>  mon     Offline
osd1.ceph.<removed>   10.0.0.<R>  osd1
osd2.ceph.<removed>   10.0.0.<R>  osd2    Offline
osd3.ceph.<removed>   10.0.0.<R>  osd3
osd4.ceph.<removed>   10.0.0.<R>  osd4    Offline
6 hosts in cluster

Logs:

10/24/22 2:19:41 PM
[INF]
Cluster is now healthy

10/24/22 2:19:41 PM
[INF]
Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons
or devices)

10/24/22 2:18:33 PM
[WRN]
Health check failed: failed to probe daemons or devices
(CEPHADM_REFRESH_FAILED)

10/24/22 2:15:24 PM
[INF]
Cluster is now healthy

10/24/22 2:15:24 PM
[INF]
Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons
or devices)

10/24/22 2:15:24 PM
[INF]
Health check cleared: CEPHADM_HOST_CHECK_FAILED (was: 1 hosts fail cephadm
check)

10/24/22 2:13:10 PM
[WRN]
Health check failed: failed to probe daemons or devices
(CEPHADM_REFRESH_FAILED)

10/24/22 2:11:55 PM
[WRN]
Health check failed: 1 hosts fail cephadm check (CEPHADM_HOST_CHECK_FAILED)

10/24/22 2:10:00 PM
[INF]
overall HEALTH_OK

10/24/22 2:08:47 PM
[INF]
Cluster is now healthy

10/24/22 2:08:47 PM
[INF]
Health check cleared: CEPHADM_HOST_CHECK_FAILED (was: 1 hosts fail cephadm
check)

10/24/22 2:07:39 PM
[WRN]
Health check failed: 1 hosts fail cephadm check (CEPHADM_HOST_CHECK_FAILED)

10/24/22 2:03:25 PM
[INF]
Cluster is now healthy

10/24/22 2:03:25 PM
[INF]
Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons
or devices)

10/24/22 2:02:15 PM
[ERR]
executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>',
'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>',
'osd4.ceph.<removed>'],)) failed. Traceback (most recent call last): File
"/usr/share/ceph/mgr/cephadm/utils.py", line 78, in do_work return f(*arg)
File "/usr/share/ceph/mgr/cephadm/serve.py", line 271, in refresh
self._write_client_files(client_files, host) File
"/usr/share/ceph/mgr/cephadm/serve.py", line 1088, in _write_client_files
self.mgr.ssh.check_execute_command(host, cmd) File
"/usr/share/ceph/mgr/cephadm/ssh.py", line 196, in check_execute_command
return self.mgr.wait_async(self._check_execute_command(host, cmd, stdin,
addr)) File "/usr/share/ceph/mgr/cephadm/module.py", line 597, in
wait_async return self.event_loop.get_result(coro) File
"/usr/share/ceph/mgr/cephadm/ssh.py", line 48, in get_result return
asyncio.run_coroutine_threadsafe(coro, self._loop).result() File
"/lib64/python3.6/concurrent/futures/_base.py", line 432, in result return
self.__get_result() File "/lib64/python3.6/concurrent/futures/_base.py",
line 384, in __get_result raise self._exception File
"/usr/share/ceph/mgr/cephadm/ssh.py", line 187, in _check_execute_command
raise OrchestratorError(msg) orchestrator._interface.OrchestratorError:
Command ['rm', '-f', '/etc/ceph/ceph.client.admin.keyring'] failed. rm:
cannot remove '/etc/ceph/ceph.client.admin.keyring': Is a directory

10/24/22 2:02:15 PM
[INF]
Removing admin.ceph.<removed>:/etc/ceph/ceph.client.admin.keyring

10/24/22 2:01:06 PM
[WRN]
Health check failed: failed to probe daemons or devices
(CEPHADM_REFRESH_FAILED)

10/24/22 2:00:00 PM
[INF]
overall HEALTH_OK

10/24/22 1:57:54 PM
[INF]
Cluster is now healthy

10/24/22 1:57:54 PM
[INF]
Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons
or devices)

10/24/22 1:57:54 PM
[INF]
Health check cleared: CEPHADM_HOST_CHECK_FAILED (was: 1 hosts fail cephadm
check)

10/24/22 1:56:38 PM
[WRN]
Health check failed: failed to probe daemons or devices
(CEPHADM_REFRESH_FAILED)

10/24/22 1:56:38 PM
[WRN]
Health check failed: 1 hosts fail cephadm check (CEPHADM_HOST_CHECK_FAILED)

10/24/22 1:52:18 PM
[INF]
Cluster is now healthy

-------------------------------

These statuses go offline and online sporadically. The block devices seem
to be working fine all along. The cluster alternates between HEALTH_OK
and HEALTH_WARN

Best Regards,

Martin Johansen
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx