Trying to understand why some OSDs (6 out of 21) went down in my cluster while running a CBT radosbench benchmark. From the logs below, is this a networking problem between systems, or is it some kind of FileStore problem. Looking at one crashed OSD log, I see the following crash error: 2016-09-09 21:30:29.757792 7efc6f5f1700 -1 FileStore: sync_entry timed out after 600 seconds. ceph version 10.2.1-13.el7cp (f15ca93643fee5f7d32e62c3e8a7016c1fc1e6f4) just before that I see things like: 2016-09-09 21:18:07.391760 7efc755fd700 -1 osd.12 165 heartbeat_check: no reply from osd.6 since back 2016-09-09 21:17:47.261601 front 2016-09-09 21:17:47.261601 (cutoff 2016-09-09 21:17:47.391758) and also 2016-09-09 19:03:45.788327 7efc53905700 0 -- 10.0.1.2:6826/58682 >> 10.0.1.1:6832/19713 pipe(0x7efc8bfbc800 sd=65 :52000 s=1 pgs=12 cs=1 l=0\ c=0x7efc8bef5b00).connect got RESETSESSION and many warnings for slow requests. All the other osds that died seem to have died with: 2016-09-09 19:11:01.663262 7f2157e65700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f2157e65700 time 2016-09-09 19:11:01.660671 common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout") -- Tom Deneau, AMD _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com