Re: Debug cluster warnings "CEPHADM_HOST_CHECK_FAILED", "CEPHADM_REFRESH_FAILED" etc

Martin Johansen <martin@xxxxxxxxx> · Mon, 24 Oct 2022 14:56:22 +0200

Hi, thank you, we replaced the domain of the service in text before
reporting the issue. Sorry, I should have mentioned.

admin.ceph.example.com was turned into  admin.ceph.<removed> for
privacy sake.

Best Regards,

Martin Johansen

On Mon, Oct 24, 2022 at 2:53 PM Murilo Morais <murilo@xxxxxxxxxxxxxx> wrote:

> Hello Martin.
>
> Apparently cephadm is not able to resolve to `admin.ceph.<removed>`, check
> /etc/hosts or your DNS, try to ping and check if the IPs in `ceph orch host
> ls` are pinged and there is no packet loss.
>
> Try according to the documentation:
>
> https://docs.ceph.com/en/quincy/cephadm/operations/#cephadm-host-check-failed
>
> Em seg., 24 de out. de 2022 às 09:23, Martin Johansen <martin@xxxxxxxxx>
> escreveu:
>
>> Hi, I deployed a Ceph cluster a week ago and have started experiencing
>> warnings. Any pointers as to how to further debug or fix it? Here is info
>> about the warnings:
>>
>> # ceph version
>> ceph version 17.2.4 (1353ed37dec8d74973edc3d5d5908c20ad5a7332) quincy
>> (stable)
>>
>> # ceph status
>>   cluster:
>>     id:     <removed>
>>     health: HEALTH_WARN
>>             1 hosts fail cephadm check
>>
>>   services:
>>     mon:        5 daemons, quorum admin.ceph.<removed>,mon,osd1,osd2,osd3
>> (age 79m)
>>     mgr:        admin.ceph.<removed>.wvhmky(active, since 2h), standbys:
>> mon.jzfopv
>>     osd:        4 osds: 4 up (since 3h), 4 in (since 3h)
>>     rbd-mirror: 2 daemons active (2 hosts)
>>     rgw:        5 daemons active (5 hosts, 1 zones)
>>
>>   data:
>>     pools:   9 pools, 226 pgs
>>     objects: 736 objects, 1.4 GiB
>>     usage:   7.3 GiB used, 2.0 TiB / 2.1 TiB avail
>>     pgs:     226 active+clean
>>
>>   io:
>>     client:   36 KiB/s rd, 19 KiB/s wr, 35 op/s rd, 26 op/s wr
>>
>> # journalctl -u ceph-<removed>@mgr.admin.ceph.<removed>.wvhmky | grep
>> "cephadm ERROR"
>> Oct 19 13:45:08 admin.ceph.<removed> bash[4445]: debug
>> 2022-10-19T11:45:08.163+0000 7fa7afabc700  0 [cephadm ERROR cephadm.ssh]
>> Unable to write
>>
>> admin.ceph.<removed>:/var/lib/ceph/1ea33904-4bad-11ed-b842-2d991d088396/config/ceph.conf:
>> Unable to reach remote host admin.ceph.<removed>.
>> Oct 19 13:45:08 admin.ceph.<removed> bash[4445]: debug
>> 2022-10-19T11:45:08.167+0000 7fa7bb2d3700  0 [cephadm ERROR cephadm.utils]
>> executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>',
>> 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],))
>> failed.
>> Oct 19 21:16:37 admin.ceph.<removed> bash[4445]: debug
>> 2022-10-19T19:16:37.504+0000 7fa7ba2d1700  0 [cephadm ERROR cephadm.utils]
>> executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>',
>> 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],))
>> failed.
>> Oct 21 14:00:52 admin.ceph.<removed> bash[4445]: debug
>> 2022-10-21T12:00:52.035+0000 7fa7afabc700  0 [cephadm ERROR cephadm.ssh]
>> Unable to write
>>
>> admin.ceph.<removed>:/var/lib/ceph/1ea33904-4bad-11ed-b842-2d991d088396/config/ceph.conf:
>> Unable to reach remote host admin.ceph.<removed>.
>> Oct 21 14:00:52 admin.ceph.<removed> bash[4445]: debug
>> 2022-10-21T12:00:52.047+0000 7fa7bc2d5700  0 [cephadm ERROR cephadm.utils]
>> executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>',
>> 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],))
>> failed.
>> Oct 21 14:25:04 admin.ceph.<removed> bash[4445]: debug
>> 2022-10-21T12:25:03.994+0000 7fa7bc2d5700  0 [cephadm ERROR cephadm.utils]
>> executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>',
>> 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],))
>> failed.
>> Oct 21 16:03:48 admin.ceph.<removed> bash[4445]: debug
>> 2022-10-21T14:03:48.320+0000 7fa7ba2d1700  0 [cephadm ERROR cephadm.utils]
>> executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>',
>> 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],))
>> failed.
>> Oct 22 06:26:17 admin.ceph.<removed> bash[4445]: debug
>> 2022-10-22T04:26:17.051+0000 7fa7afabc700  0 [cephadm ERROR cephadm.ssh]
>> Unable to write admin.ceph.<removed>:/etc/ceph/ceph.client.admin.keyring:
>> Unable to reach remote host admin.ceph.<removed>.
>> Oct 22 06:26:17 admin.ceph.<removed> bash[4445]: debug
>> 2022-10-22T04:26:17.055+0000 7fa7b8ace700  0 [cephadm ERROR cephadm.utils]
>> executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>',
>> 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>'],))
>> failed.
>> ... Continues to this day
>>
>> # journalctl -u ceph-<removed>@mgr.admin.ceph.<removed>.wvhmky | grep
>> "auth: could not find secret_id"
>> Oct 19 16:52:48 admin.ceph.<removed> bash[4445]: debug
>> 2022-10-19T14:52:48.789+0000 7fa7f3120700  0 auth: could not find
>> secret_id=123
>> Oct 19 16:52:48 admin.ceph.<removed> bash[4445]: debug
>> 2022-10-19T14:52:48.989+0000 7fa7f3120700  0 auth: could not find
>> secret_id=123
>> Oct 19 16:52:49 admin.ceph.<removed> bash[4445]: debug
>> 2022-10-19T14:52:49.393+0000 7fa7f3120700  0 auth: could not find
>> secret_id=123
>> Oct 19 16:52:50 admin.ceph.<removed> bash[4445]: debug
>> 2022-10-19T14:52:50.197+0000 7fa7f3120700  0 auth: could not find
>> secret_id=123
>> ... Continues to this day
>>
>> # journalctl -u ceph-<removed>@mgr.admin.ceph.<removed>.wvhmky | grep "Is
>> a
>> directory"
>> Oct 24 11:12:53 admin.ceph.<removed> bash[4445]:
>> orchestrator._interface.OrchestratorError: Command ['rm', '-f',
>> '/etc/ceph/ceph.client.admin.keyring'] failed. rm: cannot remove
>> '/etc/ceph/ceph.client.admin.keyring': Is a directory
>> ... Continues to this day
>>
>> # ceph orch host ls
>> HOST                  ADDR        LABELS  STATUS
>> admin.ceph.<removed>  10.0.0.<R>  _admin
>> mon.ceph.<removed>    10.0.0.<R>  mon     Offline
>> osd1.ceph.<removed>   10.0.0.<R>  osd1
>> osd2.ceph.<removed>   10.0.0.<R>  osd2    Offline
>> osd3.ceph.<removed>   10.0.0.<R>  osd3
>> osd4.ceph.<removed>   10.0.0.<R>  osd4    Offline
>> 6 hosts in cluster
>>
>> Logs:
>>
>> 10/24/22 2:19:41 PM
>> [INF]
>> Cluster is now healthy
>>
>> 10/24/22 2:19:41 PM
>> [INF]
>> Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons
>> or devices)
>>
>> 10/24/22 2:18:33 PM
>> [WRN]
>> Health check failed: failed to probe daemons or devices
>> (CEPHADM_REFRESH_FAILED)
>>
>> 10/24/22 2:15:24 PM
>> [INF]
>> Cluster is now healthy
>>
>> 10/24/22 2:15:24 PM
>> [INF]
>> Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons
>> or devices)
>>
>> 10/24/22 2:15:24 PM
>> [INF]
>> Health check cleared: CEPHADM_HOST_CHECK_FAILED (was: 1 hosts fail cephadm
>> check)
>>
>> 10/24/22 2:13:10 PM
>> [WRN]
>> Health check failed: failed to probe daemons or devices
>> (CEPHADM_REFRESH_FAILED)
>>
>> 10/24/22 2:11:55 PM
>> [WRN]
>> Health check failed: 1 hosts fail cephadm check
>> (CEPHADM_HOST_CHECK_FAILED)
>>
>> 10/24/22 2:10:00 PM
>> [INF]
>> overall HEALTH_OK
>>
>> 10/24/22 2:08:47 PM
>> [INF]
>> Cluster is now healthy
>>
>> 10/24/22 2:08:47 PM
>> [INF]
>> Health check cleared: CEPHADM_HOST_CHECK_FAILED (was: 1 hosts fail cephadm
>> check)
>>
>> 10/24/22 2:07:39 PM
>> [WRN]
>> Health check failed: 1 hosts fail cephadm check
>> (CEPHADM_HOST_CHECK_FAILED)
>>
>> 10/24/22 2:03:25 PM
>> [INF]
>> Cluster is now healthy
>>
>> 10/24/22 2:03:25 PM
>> [INF]
>> Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons
>> or devices)
>>
>> 10/24/22 2:02:15 PM
>> [ERR]
>> executing refresh((['admin.ceph.<removed>', 'mon.ceph.<removed>',
>> 'osd1.ceph.<removed>', 'osd2.ceph.<removed>', 'osd3.ceph.<removed>',
>> 'osd4.ceph.<removed>'],)) failed. Traceback (most recent call last): File
>> "/usr/share/ceph/mgr/cephadm/utils.py", line 78, in do_work return f(*arg)
>> File "/usr/share/ceph/mgr/cephadm/serve.py", line 271, in refresh
>> self._write_client_files(client_files, host) File
>> "/usr/share/ceph/mgr/cephadm/serve.py", line 1088, in _write_client_files
>> self.mgr.ssh.check_execute_command(host, cmd) File
>> "/usr/share/ceph/mgr/cephadm/ssh.py", line 196, in check_execute_command
>> return self.mgr.wait_async(self._check_execute_command(host, cmd, stdin,
>> addr)) File "/usr/share/ceph/mgr/cephadm/module.py", line 597, in
>> wait_async return self.event_loop.get_result(coro) File
>> "/usr/share/ceph/mgr/cephadm/ssh.py", line 48, in get_result return
>> asyncio.run_coroutine_threadsafe(coro, self._loop).result() File
>> "/lib64/python3.6/concurrent/futures/_base.py", line 432, in result return
>> self.__get_result() File "/lib64/python3.6/concurrent/futures/_base.py",
>> line 384, in __get_result raise self._exception File
>> "/usr/share/ceph/mgr/cephadm/ssh.py", line 187, in _check_execute_command
>> raise OrchestratorError(msg) orchestrator._interface.OrchestratorError:
>> Command ['rm', '-f', '/etc/ceph/ceph.client.admin.keyring'] failed. rm:
>> cannot remove '/etc/ceph/ceph.client.admin.keyring': Is a directory
>>
>> 10/24/22 2:02:15 PM
>> [INF]
>> Removing admin.ceph.<removed>:/etc/ceph/ceph.client.admin.keyring
>>
>> 10/24/22 2:01:06 PM
>> [WRN]
>> Health check failed: failed to probe daemons or devices
>> (CEPHADM_REFRESH_FAILED)
>>
>> 10/24/22 2:00:00 PM
>> [INF]
>> overall HEALTH_OK
>>
>> 10/24/22 1:57:54 PM
>> [INF]
>> Cluster is now healthy
>>
>> 10/24/22 1:57:54 PM
>> [INF]
>> Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons
>> or devices)
>>
>> 10/24/22 1:57:54 PM
>> [INF]
>> Health check cleared: CEPHADM_HOST_CHECK_FAILED (was: 1 hosts fail cephadm
>> check)
>>
>> 10/24/22 1:56:38 PM
>> [WRN]
>> Health check failed: failed to probe daemons or devices
>> (CEPHADM_REFRESH_FAILED)
>>
>> 10/24/22 1:56:38 PM
>> [WRN]
>> Health check failed: 1 hosts fail cephadm check
>> (CEPHADM_HOST_CHECK_FAILED)
>>
>> 10/24/22 1:52:18 PM
>> [INF]
>> Cluster is now healthy
>>
>> -------------------------------
>>
>> These statuses go offline and online sporadically. The block devices seem
>> to be working fine all along. The cluster alternates between HEALTH_OK
>> and HEALTH_WARN
>>
>> Best Regards,
>>
>> Martin Johansen
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx