Help on diag needed : heartbeat_failed

Vincent Godin <vince.mlist@xxxxxxxxx> · Tue, 26 Nov 2019 15:10:47 +0100

We encounter a strange behavior on our Mimic 13.2.6 cluster. A any
time, and without any load, some OSDs become unreachable from only
some hosts. It last 10 mn and then the problem vanish.
It 's not always the same OSDs and the same hosts. There is no network
failure on any of the host (because only some OSDs become unreachable)
nor disk freeze as we can see in our grafana dashboard. Logs message
are :
first msg :
2019-11-24 09:19:43.292 7fa9980fc700 -1 osd.596 146481
heartbeat_check: no reply from 192.168.6.112:6817 osd.394 since back
2019-11-24 09:19:22.761142 front 2019-11-24 09:19:39.769138 (cutoff
2019-11-24 09:19:23.293436)
last msg:
2019-11-24 09:30:33.735 7f632354f700 -1 osd.591 146481
heartbeat_check: no reply from 192.168.6.123:6828 osd.600 since back
2019-11-24 09:27:05.269330 front 2019-11-24 09:30:33.214874 (cutoff
2019-11-24 09:30:13.736517)
During this time, 3 hosts were involved : host-18, host-20 and host-30 :
host-30 is the only one who can't see osds 346,356,and 352 on host-18
host-30 is the only one who can't see osds 387 and 394 on host-20
host-18 is the only one who can't see osds 583, 585, 591 and 597 on host-30
We can't see any strange behavior on hosts 18, 20 and 30 in our node
exporter data during this time
Any ideas or advices ?
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com