heartbeat_check failures

Peter Kerdisle <peter.kerdisle@xxxxxxxxx> · Mon, 20 Jun 2016 13:00:27 +0200

Hey guys,
Today I noticed when adding new monitors to the cluster that two OSD servers couldn't talk to each other for some reason. I am not sure if adding the monitors caused this issue or whether the issue was always there but adding the monitor showed it. After removing the new monitor the cluster went back to healthy but the following errors are still being spewed.

On both servers all the OSD logs show various messages like:

2016-06-20 12:51:32.148682 7f6d24024700 -1 osd.102 17667 heartbeat_check: no reply from osd.89 ever on either front or back, first ping sent 2016-06-20 11:12:47.527049 (cutoff 2016-06-20 12:51:12.148679)
2016-06-20 12:51:32.148699 7f6d24024700 -1 osd.102 17667 heartbeat_check: no reply from osd.90 ever on either front or back, first ping sent 2016-06-20 11:12:47.527049 (cutoff 2016-06-20 12:51:12.148679)
2016-06-20 12:51:32.148708 7f6d24024700 -1 osd.102 17667 heartbeat_check: no reply from osd.91 ever on either front or back, first ping sent 2016-06-20 11:12:47.527049 (cutoff 2016-06-20 12:51:12.148679)
2016-06-20 12:51:32.148717 7f6d24024700 -1 osd.102 17667 heartbeat_check: no reply from osd.92 ever on either front or back, first ping sent 2016-06-20 11:12:47.527049 (cutoff 2016-06-20 12:51:12.148679)
2016-06-20 12:51:32.148724 7f6d24024700 -1 osd.102 17667 heartbeat_check: no reply from osd.93 ever on either front or back, first ping sent 2016-06-20 11:12:47.527049 (cutoff 2016-06-20 12:51:12.148679)
2016-06-20 12:51:32.148763 7f6d24024700 -1 osd.102 17667 heartbeat_check: no reply from osd.95 ever on either front or back, first ping sent 2016-06-20 11:12:47.527049 (cutoff 2016-06-20 12:51:12.148679)
2016-06-20 12:51:32.148770 7f6d24024700 -1 osd.102 17667 heartbeat_check: no reply from osd.96 ever on either front or back, first ping sent 2016-06-20 11:12:47.527049 (cutoff 2016-06-20 12:51:12.148679)

On Server A these errors are all generated mentioning Server B's OSDs and on Server B it's reported on Server A's OSDs. None of the other 10 servers have any of these issues.

I confirmed using telnet that the OSD ports are reachable. 

I'm using a cluster and public network, one of the things I did notice is this error: "0 -- private-ip-server-a:0/15329 >> public-ip-server-b:6806/6465 pipe(0x7f9910761000 sd=64 :0 s=1 pgs=0 cs=0 l=1 c=0x7f9910f7e100).fault"

This seems to imply that server A is trying to connect to server B from it's cluster ip to the client ip. Could this be the root cause? And if so how can I prevent that from happening?

Thanks,

Peter

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com