Hi Christian, Do you have a core file? Can you dump the thread stack traces so we can see if it got hung up on a syscall or somewhere internally (thread apply all bt)? Thanks- sage On Fri, 11 Nov 2011, Christian Brunner wrote: > I have upgraded to 0.38 today. After a few hours two of my OSDs > crashed with "hit suicide timeout". After a restart, they are up > again. > > I've seen this in prior versions, so I don't think it's related to the > upgrade. I just wanted to report that it's still there. > > Here is what I've found in our syslog: > > Nov 11 17:05:59 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map > is_healthy 'OSD::op_tp thread 0x7f0809dc7700' had timed out after 30 > Nov 11 17:05:59 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map > is_healthy 'FileStore::op_tp thread 0x7f080f6d3700' had timed out > after 60 > Nov 11 17:05:59 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map > is_healthy 'FileStore::op_tp thread 0x7f080fed4700' had timed out > after 60 > Nov 11 17:06:04 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map > is_healthy 'OSD::op_tp thread 0x7f0809dc7700' had timed out after 30 > Nov 11 17:06:04 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map > is_healthy 'FileStore::op_tp thread 0x7f080f6d3700' had timed out > after 60 > Nov 11 17:06:04 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map > is_healthy 'FileStore::op_tp thread 0x7f080fed4700' had timed out > after 60 > Nov 11 17:06:09 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map > is_healthy 'OSD::op_tp thread 0x7f0809dc7700' had timed out after 30 > Nov 11 17:06:09 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map > is_healthy 'FileStore::op_tp thread 0x7f080f6d3700' had timed out > after 60 > Nov 11 17:06:09 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map > is_healthy 'FileStore::op_tp thread 0x7f080fed4700' had timed out > after 60 > Nov 11 17:06:14 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map > is_healthy 'OSD::op_tp thread 0x7f0809dc7700' had timed out after 30 > Nov 11 17:06:14 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map > is_healthy 'FileStore::op_tp thread 0x7f080f6d3700' had timed out > after 60 > Nov 11 17:06:14 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map > is_healthy 'FileStore::op_tp thread 0x7f080fed4700' had timed out > after 60 > Nov 11 17:06:19 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map > is_healthy 'OSD::op_tp thread 0x7f0809dc7700' had timed out after 30 > Nov 11 17:06:19 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map > is_healthy 'FileStore::op_tp thread 0x7f080f6d3700' had timed out > after 60 > Nov 11 17:06:19 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map > is_healthy 'FileStore::op_tp thread 0x7f080fed4700' had timed out > after 60 > Nov 11 17:06:24 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map > is_healthy 'OSD::op_tp thread 0x7f0809dc7700' had timed out after 30 > Nov 11 17:06:24 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map > is_healthy 'FileStore::op_tp thread 0x7f080f6d3700' had timed out > after 60 > Nov 11 17:06:24 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map > is_healthy 'FileStore::op_tp thread 0x7f080fed4700' had timed out > after 60 > Nov 11 17:06:29 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map > is_healthy 'OSD::op_tp thread 0x7f0809dc7700' had timed out after 30 > Nov 11 17:06:29 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map > is_healthy 'FileStore::op_tp thread 0x7f080f6d3700' had timed out > after 60 > Nov 11 17:06:29 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map > is_healthy 'FileStore::op_tp thread 0x7f080f6d3700' had suicide timed > out after 180 > Nov 11 17:06:29 os03 osd.015[3641]: common/HeartbeatMap.cc: In > function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, > const char*, time_t)', in thread > '7f0816ee2700'#012common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit > suicide timeout") > Nov 11 17:06:29 os03 osd.015[3641]: ceph version 0.38 > (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9)#012 1: > (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, > long)+0x214) [0x5a7774]#012 2: (ceph::HeartbeatMap::is_healthy()+0x87) > [0x5a7a97]#012 3: (ceph::HeartbeatMap::check_touch_file()+0x20) > [0x5a7cc0]#012 4: (CephContextServiceThread::entry()+0x5f) > [0x59ee3f]#012 5: (()+0x77e1) [0x7f0818a127e1]#012 6: (clone()+0x6d) > [0x7f08171e177d] > Nov 11 17:06:29 os03 osd.015[3641]: ceph version 0.38 > (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9)#012 1: > (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, > long)+0x214) [0x5a7774]#012 2: (ceph::HeartbeatMap::is_healthy()+0x87) > [0x5a7a97]#012 3: (ceph::HeartbeatMap::check_touch_file()+0x20) > [0x5a7cc0]#012 4: (CephContextServiceThread::entry()+0x5f) > [0x59ee3f]#012 5: (()+0x77e1) [0x7f0818a127e1]#012 6: (clone()+0x6d) > [0x7f08171e177d] > Nov 11 17:06:29 os03 osd.015[3641]: *** Caught signal (Aborted) **#012 > in thread 7f0816ee2700 > Nov 11 17:06:29 os03 osd.015[3641]: ceph version 0.38 > (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9)#012 1: > /usr/bin/ceph-osd() [0x59e2a4]#012 2: (()+0xf490) [0x7f0818a1a490]#012 > 3: (gsignal()+0x35) [0x7f081712e905]#012 4: (abort()+0x175) > [0x7f08171300e5]#012 5: > (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7f08179e3a7d]#012 > 6: (()+0xbcc06) [0x7f08179e1c06]#012 7: (()+0xbcc33) > [0x7f08179e1c33]#012 8: (()+0xbcd2e) [0x7f08179e1d2e]#012 9: > (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x39f) [0x5a05ef]#012 10: > (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, > long)+0x214) [0x5a7774]#012 11: > (ceph::HeartbeatMap::is_healthy()+0x87) [0x5a7a97]#012 12: > (ceph::HeartbeatMap::check_touch_file()+0x20) [0x5a7cc0]#012 13: > (CephContextServiceThread::entry()+0x5f) [0x59ee3f]#012 14: > (()+0x77e1) [0x7f0818a127e1]#012 15: (clone()+0x6d) [0x7f08171e177d] > > Regards, > Christian > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html