Re: Hit suicide timeout after adding new osd

Wido den Hollander <wido@xxxxxxxxx> · Thu, 17 Jan 2013 15:47:27 +0100

Hi,

On 01/17/2013 03:35 PM, Jens Kristian Søgaard wrote:
Hi guys,

I had a functioning Ceph system that reported HEALTH_OK. It was running
with 3 osds on 3 servers.

Then I added an extra osd on 1 of the servers using the commands from
the documentation here:

http://ceph.com/docs/master/rados/operations/add-or-rm-osds/

Shortly after I did that 2 of the existing osds crashed.

I restarted them and after some hours they were up and running again,
but soon one of them crashed again - and a third existing osd crashed as
well. I restarted those two and waited some hours for them to come up. A
short while later one of them crashed again.

I have then restarted restarted that last one and watched the logs
closely. It seems the same patterns repeats itself every time. It starts
up doing its normal maintenance before going "up" (takes a long while).
Then it seems to be running, but logs the following every 5 seconds:

heartbeat_map is_healthy 'OSD::op_tp thread 0x7f051b7f6700' had timed
out after 30

After some time it logs:

===================================================
heartbeat_map is_healthy 'OSD::op_tp thread 0x7f051b7f6700' had suicide
timed out after 300

2013-01-17 15:24:35.051524 7f053f149700 -1 common/HeartbeatMap.cc: In
function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*,
const char*, time_t)' thread 7f053f149700 time 2013-01-17 15:24:33.849654
common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")

  ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
  1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
long)+0x2eb) [0x8462bb]
  2: (ceph::HeartbeatMap::is_healthy()+0x8e) [0x846a9e]
  3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x846cc8]
  4: (CephContextServiceThread::entry()+0x55) [0x8e01c5]
  5: /lib64/libpthread.so.0() [0x360de07d14]
  6: (clone()+0x6d) [0x360d6f167d]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

2013-01-17 15:24:35.301183 7f053f149700 -1 *** Caught signal (Aborted) **
  in thread 7f053f149700

  ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
  1: /usr/bin/ceph-osd() [0x82ea90]
  2: /lib64/libpthread.so.0() [0x360de0efe0]
  3: (gsignal()+0x35) [0x360d635925]
  4: (abort()+0x148) [0x360d6370d8]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x3611660dad]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
===================================================

How can I avoid this? - is it a bug, or have I done something wrong?

I think you are seeing the same issue as I noticed about two weeks ago: 
http://www.spinics.net/lists/ceph-devel/msg11328.html

See this issue: http://tracker.newdream.net/issues/3714

I can't find branch wip-3714 anymore, so it might be already merged into 
next.

You might want to try building from 'next' yourself or fetch some new 
packages from the RPM repos: http://eu.ceph.com/docs/master/install/rpm/

Wido

I'm running Ceph 0.56.1 from the official RPMs on Fedora 17.
The underlying disks and network connectivity has been tested and
nothing seems to be wrong there.

Thanks in advance for your assistance!

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html