Hi guys,
I had a functioning Ceph system that reported HEALTH_OK. It was running
with 3 osds on 3 servers.
Then I added an extra osd on 1 of the servers using the commands from
the documentation here:
http://ceph.com/docs/master/rados/operations/add-or-rm-osds/
Shortly after I did that 2 of the existing osds crashed.
I restarted them and after some hours they were up and running again,
but soon one of them crashed again - and a third existing osd crashed as
well. I restarted those two and waited some hours for them to come up. A
short while later one of them crashed again.
I have then restarted restarted that last one and watched the logs
closely. It seems the same patterns repeats itself every time. It starts
up doing its normal maintenance before going "up" (takes a long while).
Then it seems to be running, but logs the following every 5 seconds:
heartbeat_map is_healthy 'OSD::op_tp thread 0x7f051b7f6700' had timed
out after 30
After some time it logs:
===================================================
heartbeat_map is_healthy 'OSD::op_tp thread 0x7f051b7f6700' had suicide
timed out after 300
2013-01-17 15:24:35.051524 7f053f149700 -1 common/HeartbeatMap.cc: In
function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*,
const char*, time_t)' thread 7f053f149700 time 2013-01-17 15:24:33.849654
common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
long)+0x2eb) [0x8462bb]
2: (ceph::HeartbeatMap::is_healthy()+0x8e) [0x846a9e]
3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x846cc8]
4: (CephContextServiceThread::entry()+0x55) [0x8e01c5]
5: /lib64/libpthread.so.0() [0x360de07d14]
6: (clone()+0x6d) [0x360d6f167d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
2013-01-17 15:24:35.301183 7f053f149700 -1 *** Caught signal (Aborted) **
in thread 7f053f149700
ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
1: /usr/bin/ceph-osd() [0x82ea90]
2: /lib64/libpthread.so.0() [0x360de0efe0]
3: (gsignal()+0x35) [0x360d635925]
4: (abort()+0x148) [0x360d6370d8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x3611660dad]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
===================================================
How can I avoid this? - is it a bug, or have I done something wrong?
I'm running Ceph 0.56.1 from the official RPMs on Fedora 17.
The underlying disks and network connectivity has been tested and
nothing seems to be wrong there.
Thanks in advance for your assistance!
--
Jens Kristian Søgaard, Mermaid Consulting ApS,
jens@xxxxxxxxxxxxxxxxxxxx,
http://www.mermaidconsulting.com/
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html