Re: OSD hit suicide timeout

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Christian,

Do you have a core file?  Can you dump the thread stack traces so we can 
see if it got hung up on a syscall or somewhere internally (thread apply 
all bt)?

Thanks-
sage


On Fri, 11 Nov 2011, Christian Brunner wrote:

> I have upgraded to 0.38 today. After a few hours two of my OSDs
> crashed with "hit suicide timeout". After a restart, they are up
> again.
> 
> I've seen this in prior versions, so I don't think it's related to the
> upgrade. I just wanted to report that it's still there.
> 
> Here is what I've found in our syslog:
> 
> Nov 11 17:05:59 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
> is_healthy 'OSD::op_tp thread 0x7f0809dc7700' had timed out after 30
> Nov 11 17:05:59 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
> is_healthy 'FileStore::op_tp thread 0x7f080f6d3700' had timed out
> after 60
> Nov 11 17:05:59 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
> is_healthy 'FileStore::op_tp thread 0x7f080fed4700' had timed out
> after 60
> Nov 11 17:06:04 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
> is_healthy 'OSD::op_tp thread 0x7f0809dc7700' had timed out after 30
> Nov 11 17:06:04 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
> is_healthy 'FileStore::op_tp thread 0x7f080f6d3700' had timed out
> after 60
> Nov 11 17:06:04 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
> is_healthy 'FileStore::op_tp thread 0x7f080fed4700' had timed out
> after 60
> Nov 11 17:06:09 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
> is_healthy 'OSD::op_tp thread 0x7f0809dc7700' had timed out after 30
> Nov 11 17:06:09 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
> is_healthy 'FileStore::op_tp thread 0x7f080f6d3700' had timed out
> after 60
> Nov 11 17:06:09 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
> is_healthy 'FileStore::op_tp thread 0x7f080fed4700' had timed out
> after 60
> Nov 11 17:06:14 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
> is_healthy 'OSD::op_tp thread 0x7f0809dc7700' had timed out after 30
> Nov 11 17:06:14 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
> is_healthy 'FileStore::op_tp thread 0x7f080f6d3700' had timed out
> after 60
> Nov 11 17:06:14 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
> is_healthy 'FileStore::op_tp thread 0x7f080fed4700' had timed out
> after 60
> Nov 11 17:06:19 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
> is_healthy 'OSD::op_tp thread 0x7f0809dc7700' had timed out after 30
> Nov 11 17:06:19 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
> is_healthy 'FileStore::op_tp thread 0x7f080f6d3700' had timed out
> after 60
> Nov 11 17:06:19 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
> is_healthy 'FileStore::op_tp thread 0x7f080fed4700' had timed out
> after 60
> Nov 11 17:06:24 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
> is_healthy 'OSD::op_tp thread 0x7f0809dc7700' had timed out after 30
> Nov 11 17:06:24 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
> is_healthy 'FileStore::op_tp thread 0x7f080f6d3700' had timed out
> after 60
> Nov 11 17:06:24 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
> is_healthy 'FileStore::op_tp thread 0x7f080fed4700' had timed out
> after 60
> Nov 11 17:06:29 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
> is_healthy 'OSD::op_tp thread 0x7f0809dc7700' had timed out after 30
> Nov 11 17:06:29 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
> is_healthy 'FileStore::op_tp thread 0x7f080f6d3700' had timed out
> after 60
> Nov 11 17:06:29 os03 osd.015[3641]: 7f0816ee2700 heartbeat_map
> is_healthy 'FileStore::op_tp thread 0x7f080f6d3700' had suicide timed
> out after 180
> Nov 11 17:06:29 os03 osd.015[3641]: common/HeartbeatMap.cc: In
> function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*,
> const char*, time_t)', in thread
> '7f0816ee2700'#012common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit
> suicide timeout")
> Nov 11 17:06:29 os03 osd.015[3641]:  ceph version 0.38
> (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9)#012 1:
> (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
> long)+0x214) [0x5a7774]#012 2: (ceph::HeartbeatMap::is_healthy()+0x87)
> [0x5a7a97]#012 3: (ceph::HeartbeatMap::check_touch_file()+0x20)
> [0x5a7cc0]#012 4: (CephContextServiceThread::entry()+0x5f)
> [0x59ee3f]#012 5: (()+0x77e1) [0x7f0818a127e1]#012 6: (clone()+0x6d)
> [0x7f08171e177d]
> Nov 11 17:06:29 os03 osd.015[3641]:  ceph version 0.38
> (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9)#012 1:
> (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
> long)+0x214) [0x5a7774]#012 2: (ceph::HeartbeatMap::is_healthy()+0x87)
> [0x5a7a97]#012 3: (ceph::HeartbeatMap::check_touch_file()+0x20)
> [0x5a7cc0]#012 4: (CephContextServiceThread::entry()+0x5f)
> [0x59ee3f]#012 5: (()+0x77e1) [0x7f0818a127e1]#012 6: (clone()+0x6d)
> [0x7f08171e177d]
> Nov 11 17:06:29 os03 osd.015[3641]: *** Caught signal (Aborted) **#012
> in thread 7f0816ee2700
> Nov 11 17:06:29 os03 osd.015[3641]:  ceph version 0.38
> (commit:b600ec2ac7c0f2e508720f8e8bb87c3db15509b9)#012 1:
> /usr/bin/ceph-osd() [0x59e2a4]#012 2: (()+0xf490) [0x7f0818a1a490]#012
> 3: (gsignal()+0x35) [0x7f081712e905]#012 4: (abort()+0x175)
> [0x7f08171300e5]#012 5:
> (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7f08179e3a7d]#012
> 6: (()+0xbcc06) [0x7f08179e1c06]#012 7: (()+0xbcc33)
> [0x7f08179e1c33]#012 8: (()+0xbcd2e) [0x7f08179e1d2e]#012 9:
> (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x39f) [0x5a05ef]#012 10:
> (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
> long)+0x214) [0x5a7774]#012 11:
> (ceph::HeartbeatMap::is_healthy()+0x87) [0x5a7a97]#012 12:
> (ceph::HeartbeatMap::check_touch_file()+0x20) [0x5a7cc0]#012 13:
> (CephContextServiceThread::entry()+0x5f) [0x59ee3f]#012 14:
> (()+0x77e1) [0x7f0818a127e1]#012 15: (clone()+0x6d) [0x7f08171e177d]
> 
> Regards,
> Christian
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux