Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143166 7f5b03679700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f5affe72700' had timed out after 15
Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143176 7f5b03679700 10 osd.1191 1203850 internal heartbeat not healthy, dropping ping request
Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143210 7f5b04e7c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f5affe72700' had timed out after 15
Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143218 7f5b04e7c700 10 osd.1191 1203850 internal heartbeat not healthy, dropping ping request
Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143288 7f5b03679700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f5affe72700' had timed out after 15
Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143293 7f5b03679700 10 osd.1191 1203850 internal heartbeat not healthy, dropping ping request
I have a chunk of logs at debug 20/5, not sure if I should have done just 20... It's pretty hard to catch, we have to basically see the slow requests and get debug logging set in about a 5-10 second window before the OSD stops responding to the admin socket...
As networking is almost always the cause of flapping OSDs we have tested the network quite extensively. It hasn't changed physically since before the hammer upgrade, and was performing well. We have done large amounts of ping tests and have not seen a single dropped packet between osd nodes or between osd nodes and mons.
I don't see any error packets or drops on switches either.
Ideas?
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com