Need advice how to proceed with [WRN] CEPHADM_HOST_CHECK_FAILED

"Kalin Nikolov" <knikolov@xxxxxxxxxxx> · Fri, 13 May 2022 11:10:03 +0300

Hello, 
for about a year and a half I have been supporting a cluster of Ceph for my
company  (v.15.2.3 on centos 8 which is out of support already) that is used
only for S3 and until recently there were no serious problems that I could
not deal with of a different nature, 
but the last problem that appeared about 2 months ago I can not find a
solution alone. 
After adding a firewall for a short time (about 15-20 minutes), each of the
hosts was isolated from the monitoring servers, which led to the following
error message:

ceph> health detail
HEALTH_ERR 8 hosts fail cephadm check; failed to probe daemons or devices;
Module 'cephadm' has failed: cannot send (already closed?)
[WRN] CEPHADM_HOST_CHECK_FAILED: 8 hosts fail cephadm check
    host mon4 failed check: cannot send (already closed?)
    host mon5 failed check: cannot send (already closed?)
    host rgw1 failed check: cannot send (already closed?)
    host srv1 failed check: cannot send (already closed?)
    host srv2 failed check: cannot send (already closed?)
    host srv3 failed check: cannot send (already closed?)
    host srv4 failed check: cannot send (already closed?)
    host srv5 failed check: cannot send (already closed?)

[WRN] CEPHADM_REFRESH_FAILED: failed to probe daemons or devices
    host mon4 scrape failed: cannot send (already closed?)
    host mon4 ceph-volume inventory failed: cannot send (already closed?)
    host mon5 scrape failed: cannot send (already closed?)
    host mon5 ceph-volume inventory failed: cannot send (already closed?)
    host rgw1 scrape failed: cannot send (already closed?)
    host rgw1 ceph-volume inventory failed: cannot send (already closed?)
    host srv1 scrape failed: cannot send (already closed?)
    host srv1 ceph-volume inventory failed: cannot send (already closed?)
    host srv2 scrape failed: cannot send (already closed?)
    host srv2 ceph-volume inventory failed: cannot send (already closed?)
    host srv3 scrape failed: cannot send (already closed?)
    host srv3 ceph-volume inventory failed: cannot send (already closed?)
    host srv4 scrape failed: cannot send (already closed?)
    host srv4 ceph-volume inventory failed: cannot send (already closed?)
    host srv5 scrape failed: cannot send (already closed?)
    host srv5 ceph-volume inventory failed: cannot send (already closed?)

Despite these errors, the cluster is working and the data is currently being
accessed normally. 
I have not noticed any of the services dropped. Despite the errors, it was
necessary to add a new srv6 server, 
which was normally added to the cluster and worked as expected, but
immediately after that another error occurred:

[ERR] MGR_MODULE_ERROR: Module 'cephadm' has failed: cannot send (already
closed?)
    Module 'cephadm' has failed: cannot send (already closed?)

Which put the cluster in ERROR state. The hosts are alive and connected.

#ceph orch host ls
HOST ADDR LABELS STATUS
adm adm mgr
mon1 mon1 mgr
mon2 mon2
mon3 mon3 mgr
mon4 mon4
mon5 mon5
rgw1 rgw1
rgw2-real rgw2-real
srv1 srv1
srv2 srv2
srv3 srv3
srv4 srv4
srv5 192.168.236.215
srv6 192.168.236.216

Any advice is welcome. I read everything that is related to the errors in
question and that I was able to find in the different groups, but none of
the proposed solutions led to a positive result.

Regards,
Kalin

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx