Re: Debug cluster warnings "CEPHADM_HOST_CHECK_FAILED", "CEPHADM_REFRESH_FAILED" etc

Murilo Morais <murilo@xxxxxxxxxxxxxx> · Mon, 24 Oct 2022 09:53:13 -0300

Hello Martin.

Apparently cephadm is not able to resolve to `admin.ceph.<removed>`, check
/etc/hosts or your DNS, try to ping and check if the IPs in `ceph orch host
ls` are pinged and there is no packet loss.

Try according to the documentation:
https://docs.ceph.com/en/quincy/cephadm/operations/#cephadm-host-check-failed

Em seg., 24 de out. de 2022 às 09:23, Martin Johansen <martin@xxxxxxxxx>
escreveu:

> Hi, I deployed a Ceph cluster a week ago and have started experiencing
> warnings. Any pointers as to how to further debug or fix it? Here is info
> about the warnings:
>
> # ceph version
> ceph version 17.2.4 (1353ed37dec8d74973edc3d5d5908c20ad5a7332) quincy
> (stable)
>
> # ceph status
>   cluster:
>     id:     <removed>
>     health: HEALTH_WARN
>             1 hosts fail cephadm check
>
>   services:
>     mon:        5 daemons, quorum admin.ceph.<removed>,mon,osd1,osd2,osd3
> (age 79m)
>     mgr:        admin.ceph.<removed>.wvhmky(active, since 2h), standbys:
> mon.jzfopv
>     osd:        4 osds: 4 up (since 3h), 4 in (since 3h)
>     rbd-mirror: 2 daemons active (2 hosts)
>     rgw:        5 daemons active (5 hosts, 1 zones)
>
>   data:
>     pools:   9 pools, 226 pgs
>     objects: 736 objects, 1.4 GiB
>     usage:   7.3 GiB used, 2.0 TiB / 2.1 TiB avail
>     pgs:     226 active+clean
>
>   io:
>     client:   36 KiB/s rd, 19 KiB/s wr, 35 op/s rd, 26 op/s wr
>
> # journalctl -u ceph-<removed>@mgr.admin.ceph.<removed>.wvhmky | grep
> "cephadm ERROR"
> Oct 19 13:45:08 admin.ceph.<removed> bash[4445]: debug
> 2022-10-19T11:45:08.163+0000 7fa7afabc700  0 [cephadm ERROR cephadm.ssh]
> Unable to write
>
> admin.ceph.<removed>:/var/lib/ceph/1ea33904-4bad-11ed-b842-2d991d088396/config/ceph.conf:
> Unable to reach remote host admin.ceph.<removed>.
> Oct 19 13:45:08 admin.ceph.<removed> bash[4445]: debug
> 2022-10-19T11:45:08.167+0000 7fa7bb2d3700  0 [cephadm ERROR cephadm.utils]
> executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>',
> 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],))
> failed.
> Oct 19 21:16:37 admin.ceph.<removed> bash[4445]: debug
> 2022-10-19T19:16:37.504+0000 7fa7ba2d1700  0 [cephadm ERROR cephadm.utils]
> executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>',
> 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],))
> failed.
> Oct 21 14:00:52 admin.ceph.<removed> bash[4445]: debug
> 2022-10-21T12:00:52.035+0000 7fa7afabc700  0 [cephadm ERROR cephadm.ssh]
> Unable to write
>
> admin.ceph.<removed>:/var/lib/ceph/1ea33904-4bad-11ed-b842-2d991d088396/config/ceph.conf:
> Unable to reach remote host admin.ceph.<removed>.
> Oct 21 14:00:52 admin.ceph.<removed> bash[4445]: debug
> 2022-10-21T12:00:52.047+0000 7fa7bc2d5700  0 [cephadm ERROR cephadm.utils]
> executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>',
> 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],))
> failed.
> Oct 21 14:25:04 admin.ceph.<removed> bash[4445]: debug
> 2022-10-21T12:25:03.994+0000 7fa7bc2d5700  0 [cephadm ERROR cephadm.utils]
> executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>',
> 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],))
> failed.
> Oct 21 16:03:48 admin.ceph.<removed> bash[4445]: debug
> 2022-10-21T14:03:48.320+0000 7fa7ba2d1700  0 [cephadm ERROR cephadm.utils]
> executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>',
> 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],))
> failed.
> Oct 22 06:26:17 admin.ceph.<removed> bash[4445]: debug
> 2022-10-22T04:26:17.051+0000 7fa7afabc700  0 [cephadm ERROR cephadm.ssh]
> Unable to write admin.ceph.<removed>:/etc/ceph/ceph.client.admin.keyring:
> Unable to reach remote host admin.ceph.<removed>.
> Oct 22 06:26:17 admin.ceph.<removed> bash[4445]: debug
> 2022-10-22T04:26:17.055+0000 7fa7b8ace700  0 [cephadm ERROR cephadm.utils]
> executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>',
> 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],))
> failed.
> ... Continues to this day
>
> # journalctl -u ceph-<removed>@mgr.admin.ceph.<removed>.wvhmky | grep
> "auth: could not find secret_id"
> Oct 19 16:52:48 admin.ceph.<removed> bash[4445]: debug
> 2022-10-19T14:52:48.789+0000 7fa7f3120700  0 auth: could not find
> secret_id=123
> Oct 19 16:52:48 admin.ceph.<removed> bash[4445]: debug
> 2022-10-19T14:52:48.989+0000 7fa7f3120700  0 auth: could not find
> secret_id=123
> Oct 19 16:52:49 admin.ceph.<removed> bash[4445]: debug
> 2022-10-19T14:52:49.393+0000 7fa7f3120700  0 auth: could not find
> secret_id=123
> Oct 19 16:52:50 admin.ceph.<removed> bash[4445]: debug
> 2022-10-19T14:52:50.197+0000 7fa7f3120700  0 auth: could not find
> secret_id=123
> ... Continues to this day
>
> # journalctl -u ceph-<removed>@mgr.admin.ceph.<removed>.wvhmky | grep "Is a
> directory"
> Oct 24 11:12:53 admin.ceph.<removed> bash[4445]:
> orchestrator._interface.OrchestratorError: Command ['rm', '-f',
> '/etc/ceph/ceph.client.admin.keyring'] failed. rm: cannot remove
> '/etc/ceph/ceph.client.admin.keyring': Is a directory
> ... Continues to this day
>
> # ceph orch host ls
> HOST                  ADDR        LABELS  STATUS
> admin.ceph.<removed>  10.0.0.<R>  _admin
> mon.ceph.<removed>    10.0.0.<R>  mon     Offline
> osd1.ceph.<removed>   10.0.0.<R>  osd1
> osd2.ceph.<removed>   10.0.0.<R>  osd2    Offline
> osd3.ceph.<removed>   10.0.0.<R>  osd3
> osd4.ceph.<removed>   10.0.0.<R>  osd4    Offline
> 6 hosts in cluster
>
> Logs:
>
> 10/24/22 2:19:41 PM
> [INF]
> Cluster is now healthy
>
> 10/24/22 2:19:41 PM
> [INF]
> Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons
> or devices)
>
> 10/24/22 2:18:33 PM
> [WRN]
> Health check failed: failed to probe daemons or devices
> (CEPHADM_REFRESH_FAILED)
>
> 10/24/22 2:15:24 PM
> [INF]
> Cluster is now healthy
>
> 10/24/22 2:15:24 PM
> [INF]
> Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons
> or devices)
>
> 10/24/22 2:15:24 PM
> [INF]
> Health check cleared: CEPHADM_HOST_CHECK_FAILED (was: 1 hosts fail cephadm
> check)
>
> 10/24/22 2:13:10 PM
> [WRN]
> Health check failed: failed to probe daemons or devices
> (CEPHADM_REFRESH_FAILED)
>
> 10/24/22 2:11:55 PM
> [WRN]
> Health check failed: 1 hosts fail cephadm check (CEPHADM_HOST_CHECK_FAILED)
>
> 10/24/22 2:10:00 PM
> [INF]
> overall HEALTH_OK
>
> 10/24/22 2:08:47 PM
> [INF]
> Cluster is now healthy
>
> 10/24/22 2:08:47 PM
> [INF]
> Health check cleared: CEPHADM_HOST_CHECK_FAILED (was: 1 hosts fail cephadm
> check)
>
> 10/24/22 2:07:39 PM
> [WRN]
> Health check failed: 1 hosts fail cephadm check (CEPHADM_HOST_CHECK_FAILED)
>
> 10/24/22 2:03:25 PM
> [INF]
> Cluster is now healthy
>
> 10/24/22 2:03:25 PM
> [INF]
> Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons
> or devices)
>
> 10/24/22 2:02:15 PM
> [ERR]
> executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>',
> 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>',
> 'osd4.ceph.<removed>'],)) failed. Traceback (most recent call last): File
> "/usr/share/ceph/mgr/cephadm/utils.py", line 78, in do_work return f(*arg)
> File "/usr/share/ceph/mgr/cephadm/serve.py", line 271, in refresh
> self._write_client_files(client_files, host) File
> "/usr/share/ceph/mgr/cephadm/serve.py", line 1088, in _write_client_files
> self.mgr.ssh.check_execute_command(host, cmd) File
> "/usr/share/ceph/mgr/cephadm/ssh.py", line 196, in check_execute_command
> return self.mgr.wait_async(self._check_execute_command(host, cmd, stdin,
> addr)) File "/usr/share/ceph/mgr/cephadm/module.py", line 597, in
> wait_async return self.event_loop.get_result(coro) File
> "/usr/share/ceph/mgr/cephadm/ssh.py", line 48, in get_result return
> asyncio.run_coroutine_threadsafe(coro, self._loop).result() File
> "/lib64/python3.6/concurrent/futures/_base.py", line 432, in result return
> self.__get_result() File "/lib64/python3.6/concurrent/futures/_base.py",
> line 384, in __get_result raise self._exception File
> "/usr/share/ceph/mgr/cephadm/ssh.py", line 187, in _check_execute_command
> raise OrchestratorError(msg) orchestrator._interface.OrchestratorError:
> Command ['rm', '-f', '/etc/ceph/ceph.client.admin.keyring'] failed. rm:
> cannot remove '/etc/ceph/ceph.client.admin.keyring': Is a directory
>
> 10/24/22 2:02:15 PM
> [INF]
> Removing admin.ceph.<removed>:/etc/ceph/ceph.client.admin.keyring
>
> 10/24/22 2:01:06 PM
> [WRN]
> Health check failed: failed to probe daemons or devices
> (CEPHADM_REFRESH_FAILED)
>
> 10/24/22 2:00:00 PM
> [INF]
> overall HEALTH_OK
>
> 10/24/22 1:57:54 PM
> [INF]
> Cluster is now healthy
>
> 10/24/22 1:57:54 PM
> [INF]
> Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons
> or devices)
>
> 10/24/22 1:57:54 PM
> [INF]
> Health check cleared: CEPHADM_HOST_CHECK_FAILED (was: 1 hosts fail cephadm
> check)
>
> 10/24/22 1:56:38 PM
> [WRN]
> Health check failed: failed to probe daemons or devices
> (CEPHADM_REFRESH_FAILED)
>
> 10/24/22 1:56:38 PM
> [WRN]
> Health check failed: 1 hosts fail cephadm check (CEPHADM_HOST_CHECK_FAILED)
>
> 10/24/22 1:52:18 PM
> [INF]
> Cluster is now healthy
>
> -------------------------------
>
> These statuses go offline and online sporadically. The block devices seem
> to be working fine all along. The cluster alternates between HEALTH_OK
> and HEALTH_WARN
>
> Best Regards,
>
> Martin Johansen
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx