OSD hit suicide timeout

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I have upgraded to 0.38 today. After a few hours two of my OSDs
crashed with "hit suicide timeout". After a restart, they are up
again.

I've seen this in prior versions, so I don't think it's related to the
upgrade. I just wanted to report that it's still there.

Here is what I've found in our syslog:

Nov 11 17:05:59 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
is_healthy 'OSD::op_tp thread 0x7f0809dc7700' had timed out after 30
Nov 11 17:05:59 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
is_healthy 'FileStore::op_tp thread 0x7f080f6d3700' had timed out
after 60
Nov 11 17:05:59 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
is_healthy 'FileStore::op_tp thread 0x7f080fed4700' had timed out
after 60
Nov 11 17:06:04 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
is_healthy 'OSD::op_tp thread 0x7f0809dc7700' had timed out after 30
Nov 11 17:06:04 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
is_healthy 'FileStore::op_tp thread 0x7f080f6d3700' had timed out
after 60
Nov 11 17:06:04 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
is_healthy 'FileStore::op_tp thread 0x7f080fed4700' had timed out
after 60
Nov 11 17:06:09 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
is_healthy 'OSD::op_tp thread 0x7f0809dc7700' had timed out after 30
Nov 11 17:06:09 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
is_healthy 'FileStore::op_tp thread 0x7f080f6d3700' had timed out
after 60
Nov 11 17:06:09 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
is_healthy 'FileStore::op_tp thread 0x7f080fed4700' had timed out
after 60
Nov 11 17:06:14 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
is_healthy 'OSD::op_tp thread 0x7f0809dc7700' had timed out after 30
Nov 11 17:06:14 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
is_healthy 'FileStore::op_tp thread 0x7f080f6d3700' had timed out
after 60
Nov 11 17:06:14 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
is_healthy 'FileStore::op_tp thread 0x7f080fed4700' had timed out
after 60
Nov 11 17:06:19 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
is_healthy 'OSD::op_tp thread 0x7f0809dc7700' had timed out after 30
Nov 11 17:06:19 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
is_healthy 'FileStore::op_tp thread 0x7f080f6d3700' had timed out
after 60
Nov 11 17:06:19 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
is_healthy 'FileStore::op_tp thread 0x7f080fed4700' had timed out
after 60
Nov 11 17:06:24 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
is_healthy 'OSD::op_tp thread 0x7f0809dc7700' had timed out after 30
Nov 11 17:06:24 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
is_healthy 'FileStore::op_tp thread 0x7f080f6d3700' had timed out
after 60
Nov 11 17:06:24 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
is_healthy 'FileStore::op_tp thread 0x7f080fed4700' had timed out
after 60
Nov 11 17:06:29 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
is_healthy 'OSD::op_tp thread 0x7f0809dc7700' had timed out after 30
Nov 11 17:06:29 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
is_healthy 'FileStore::op_tp thread 0x7f080f6d3700' had timed out
after 60
Nov 11 17:06:29 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
is_healthy 'FileStore::op_tp thread 0x7f080f6d3700' had suicide timed
out after 180
Nov 11 17:06:29 os03 osd.015[3641]: common/HeartbeatMap.cc: In
function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*,
const char*, time_t)', in thread
'7f0816ee2700'#012common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit
suicide timeout")
Nov 11 17:06:29 os03 osd.015[3641]:  ceph version 0.38
(commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9)#012 1:
(ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
long)+0x214) [0x5a7774]#012 2: (ceph::HeartbeatMap::is_healthy()+0x87)
[0x5a7a97]#012 3: (ceph::HeartbeatMap::check_touch_file()+0x20)
[0x5a7cc0]#012 4: (CephContextServiceThread::entry()+0x5f)
[0x59ee3f]#012 5: (()+0x77e1) [0x7f0818a127e1]#012 6: (clone()+0x6d)
[0x7f08171e177d]
Nov 11 17:06:29 os03 osd.015[3641]:  ceph version 0.38
(commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9)#012 1:
(ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
long)+0x214) [0x5a7774]#012 2: (ceph::HeartbeatMap::is_healthy()+0x87)
[0x5a7a97]#012 3: (ceph::HeartbeatMap::check_touch_file()+0x20)
[0x5a7cc0]#012 4: (CephContextServiceThread::entry()+0x5f)
[0x59ee3f]#012 5: (()+0x77e1) [0x7f0818a127e1]#012 6: (clone()+0x6d)
[0x7f08171e177d]
Nov 11 17:06:29 os03 osd.015[3641]: *** Caught signal (Aborted) **#012
in thread 7f0816ee2700
Nov 11 17:06:29 os03 osd.015[3641]:  ceph version 0.38
(commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9)#012 1:
/usr/bin/ceph-osd() [0x59e2a4]#012 2: (()+0xf490) [0x7f0818a1a490]#012
3: (gsignal()+0x35) [0x7f081712e905]#012 4: (abort()+0x175)
[0x7f08171300e5]#012 5:
(__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7f08179e3a7d]#012
6: (()+0xbcc06) [0x7f08179e1c06]#012 7: (()+0xbcc33)
[0x7f08179e1c33]#012 8: (()+0xbcd2e) [0x7f08179e1d2e]#012 9:
(ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x39f) [0x5a05ef]#012 10:
(ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
long)+0x214) [0x5a7774]#012 11:
(ceph::HeartbeatMap::is_healthy()+0x87) [0x5a7a97]#012 12:
(ceph::HeartbeatMap::check_touch_file()+0x20) [0x5a7cc0]#012 13:
(CephContextServiceThread::entry()+0x5f) [0x59ee3f]#012 14:
(()+0x77e1) [0x7f0818a127e1]#012 15: (clone()+0x6d) [0x7f08171e177d]

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux