osd aborts, sync entry timeout and suicide timeout

"Deneau, Tom" <tom.deneau@xxxxxxx> · Mon, 6 Jul 2015 18:56:11 +0000

I had a small (4 nodes, 19 OSDs) cluster that I was running a sort of
stress test on over the weekend.  Let's call the 4 nodes, A, B, C and
D.  (Node A had the monitor running on it).

Anyway, node C died with a hardware problem, and, I think at about
that same time two of the 5 osds on node B aborted with asserts.  The
other 3 OSDS on node B carried on without problem as did the OSDS on
nodes A and D.  And the client tests continued to run without error.

I attached the stack traces below from the aborting OSDs below.  If
necessary, I can send the full osd logs (which include the dump of the
10000 most recent events).

I don't know enough about the ceph internals to know what these aborts
really mean.  Have others seen these kinds of aborts before?  (I would
assume these kinds of aborts are not normal).  Are they an indication
of some kind of ceph configuration problem or build problem?  As can
be seen I am running 9.0.1 which I built from sources for the aarch64
platform.

-- Tom Deneau, AMD

Aborting OSD #1
----------------
2015-07-03 20:27:47.013337 3ff7255efd0 -1 FileStore: sync_entry timed out after 600 seconds.
 ceph version 9.0.1 (997b3f998d565a744bfefaaf34b08b891f8dbf64)
 1: (Context::complete(int)+0x1c) [0x6cdbe4]
 2: (SafeTimer::timer_thread()+0x320) [0xbfed58]
 3: (SafeTimerThread::entry()+0x10) [0xc00af8]
 4: (()+0x6ed4) [0x3ff7f456ed4]
 5: (()+0xe08b0) [0x3ff7f0408b0]

2015-07-03 20:27:47.022743 3ff7255efd0 -1 os/FileStore.cc: In function 'virtual void SyncEntryTimeout::finish(int)' thread 3ff7255efd0 time 2015-07-03 20:27:47.013404
os/FileStore.cc: 3524: FAILED assert(0)

 ceph version 9.0.1 (997b3f998d565a744bfefaaf34b08b891f8dbf64)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8c) [0xc14b9c]
 2: (SyncEntryTimeout::finish(int)+0xc4) [0x91552c]
 3: (Context::complete(int)+0x1c) [0x6cdbe4]
 4: (SafeTimer::timer_thread()+0x320) [0xbfed58]
 5: (SafeTimerThread::entry()+0x10) [0xc00af8]
 6: (()+0x6ed4) [0x3ff7f456ed4]
 7: (()+0xe08b0) [0x3ff7f0408b0]

Aborting OSD #2
----------------
2015-07-03 20:28:14.693496 3ff6d2eefd0 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, time_t)' thread 3ff6d2eefd0 time 2015-07-03 20:28:14.665989
common/HeartbeatMapX.cc: 79: FAILED assert(0 == "hit suicide timeout")

 ceph version 9.0.1 (997b3f998d565a744bfefaaf34b08b891f8dbf64)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8c) [0xc14b9c]
 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x328) [0xb5afc0]
 3: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, long, long)+0x19c) [0xb5b2b4]
 4: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x6fc) [0xc06fdc]
 5: (ShardedThreadPool::WorkThreadSharded::entry()+0x18) [0xc07a98]
 6: (()+0x6ed4) [0x3ff85696ed4]
 7: (()+0xe08b0) [0x3ff852808b0]

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html