domino-style OSD crash

Yann Dupont <Yann.Dupont@xxxxxxxxxxxxxx> · Mon, 04 Jun 2012 10:44:35 +0200

Hello,
Besides the performance inconsistency (see other thread titled poor OSD 
performance using kernel 3.4) where I promised some tests (will run this 
afternoon), we tried this week-end to stress test ceph, making backups 
with bacula on a rbd volume of 15T (8 osd nodes, using 8 physical machines)

Results : Worked like a charm during two days, apart btrfs warn messages 
then OSD begin to crash 1 after all 'domino style'.

This morning, only 2 OSD of 8 are left.

1 of the physical machine was in kernel oops state - Nothing was remote 
logged, don't know what happened, there were no clear stack message. I 
suspect btrfs , but I have no proof.

This node (OSD.7) seems to have been the 1st one to crash, generated 
reconstruction between OSD & then lead to the cascade osd crash.

The other physical machines are still up, but with no osd running. here 
are some trace found in osd log :

   -3> 2012-06-03 12:43:32.524671 7ff1352b8700  0 log [WRN] : slow 
request 30.506952 seconds old, rec
eived at 2012-06-03 12:43:01.997386: osd_sub_op(osd.0.0:1842628 2.57 
ea8d5657/label5_17606_object7068/
head [push] v 191'628 snapset=0=[]:[] snapc=0=[]) v6 currently queued for pg
    -2> 2012-06-03 12:44:32.869852 7ff1352b8700  0 log [WRN] : 1 slow 
requests, 1 included below; olde
st blocked for > 30.073136 secs
    -1> 2012-06-03 12:44:32.869886 7ff1352b8700  0 log [WRN] : slow 
request 30.073136 seconds old, rec
eived at 2012-06-03 12:44:02.796651: osd_sub_op(osd.6.0:1837430 2.59 
97e62059/rb.0.1.0000000a2cdf/head
 [push] v 1438'9416 snapset=0=[]:[] snapc=0=[]) v6 currently started
     0> 2012-06-03 12:55:33.088034 7ff1237f6700 -1 *** Caught signal 
(Aborted) **
 in thread 7ff1237f6700

 ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372)
 1: /usr/bin/ceph-osd() [0x708ea9]
 2: (()+0xeff0) [0x7ff13af2cff0]
 3: (gsignal()+0x35) [0x7ff13950b1b5]
 4: (abort()+0x180) [0x7ff13950dfc0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7ff139d9fdc5]
 6: (()+0xcb166) [0x7ff139d9e166]
 7: (()+0xcb193) [0x7ff139d9e193]
 8: (()+0xcb28e) [0x7ff139d9e28e]
 9: (std::__throw_length_error(char const*)+0x67) [0x7ff139d39307]
 10: (std::string::_Rep::_S_create(unsigned long, unsigned long, 
std::allocator<char> const&)+0x72) [0x7ff139d7ab42]
 11: (()+0xa8565) [0x7ff139d7b565]
 12: (std::basic_string<char, std::char_traits<char>, 
std::allocator<char> >::basic_string(char const*, unsigned long, 
std::allocator<char> const&)+0x1b) [0x7ff139d7b7ab]
 13: 
(leveldb::InternalKeyComparator::FindShortestSeparator(std::string*, 
leveldb::Slice const&) const+0x4d) [0x6ef69d]
 14: (leveldb::TableBuilder::Add(leveldb::Slice const&, leveldb::Slice 
const&)+0x9f) [0x6fdd9f]
 15: 
(leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*)+0x4d3) 
[0x6eaba3]
 16: (leveldb::DBImpl::BackgroundCompaction()+0x222) [0x6ebb02]
 17: (leveldb::DBImpl::BackgroundCall()+0x68) [0x6ec378]
 18: /usr/bin/ceph-osd() [0x704981]
 19: (()+0x68ca) [0x7ff13af248ca]
 20: (clone()+0x6d) [0x7ff1395a892d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

2 OSD exhibit similar traces.

---

4 other had traces like this one :

    -5> 2012-06-03 13:31:39.393489 7f74fd9c7700 -1 osd.3 1513 
heartbeat_check: no reply from osd.5 sin
ce 2012-06-03 13:31:18.459792 (cutoff 2012-06-03 13:31:19.393488)
    -4> 2012-06-03 13:31:40.393689 7f74fd9c7700 -1 osd.3 1513 
heartbeat_check: no reply from osd.5 sin
ce 2012-06-03 13:31:18.459792 (cutoff 2012-06-03 13:31:20.393687)
    -3> 2012-06-03 13:31:41.402873 7f74fd9c7700 -1 osd.3 1513 
heartbeat_check: no reply from osd.5 sin
ce 2012-06-03 13:31:18.459792 (cutoff 2012-06-03 13:31:21.402872)
    -2> 2012-06-03 13:31:42.363270 7f74f08ac700 -1 osd.3 1513 
heartbeat_check: no reply from osd.5 sin
ce 2012-06-03 13:31:18.459792 (cutoff 2012-06-03 13:31:22.363269)
    -1> 2012-06-03 13:31:42.416968 7f74fd9c7700 -1 osd.3 1513 
heartbeat_check: no reply from osd.5 sin
ce 2012-06-03 13:31:18.459792 (cutoff 2012-06-03 13:31:22.416966)
     0> 2012-06-03 13:36:48.147020 7f74f58b6700 -1 osd/PG.cc: In 
function 'void PG::merge_log(ObjectStore::Transaction&, pg_info_t&, 
pg_log_t&, int)' thread 7f74f58b6700 time 2012-06-03 13:36:48.100157
osd/PG.cc: 402: FAILED assert(log.head >= olog.tail && olog.head >= 
log.tail)

 ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372)
 1: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, 
int)+0x1eae) [0x649cce]
 2: (PG::RecoveryState::Stray::react(PG::RecoveryState::MLogRec 
const&)+0x2b1) [0x649fc1]
 3: (boost::statechart::simple_state<PG::RecoveryState::Stray, 
PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na>, 
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base 
const&, void const*)+0x203) [0x660343]
 4: 
(boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, 
PG::RecoveryState::Initial, std::allocator<void>, 
boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base 
const&)+0x6b) [0x6580eb]
 5: (PG::RecoveryState::handle_log(int, MOSDPGLog*, 
PG::RecoveryCtx*)+0x190) [0x6139d0]
 6: (OSD::handle_pg_log(std::tr1::shared_ptr<OpRequest>)+0x666) [0x5cec66]
 7: (OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>)+0x11b) [0x5d312b]
 8: (OSD::_dispatch(Message*)+0x173) [0x5dc273]
 9: (OSD::ms_dispatch(Message*)+0x1e7) [0x5dcba7]
 10: (SimpleMessenger::dispatch_entry()+0x979) [0x7d60a9]
 11: (SimpleMessenger::DispatchThread::entry()+0xd) [0x72781d]
 12: (()+0x68ca) [0x7f75036338ca]
 13: (clone()+0x6d) [0x7f7501cb792d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

--- end dump of recent events ---
2012-06-03 13:36:48.487021 7f74f58b6700 -1 *** Caught signal (Aborted) **
 in thread 7f74f58b6700

 ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372)
 1: /usr/bin/ceph-osd() [0x708ea9]
 2: (()+0xeff0) [0x7f750363bff0]
 3: (gsignal()+0x35) [0x7f7501c1a1b5]
 4: (abort()+0x180) [0x7f7501c1cfc0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f75024aedc5]
 6: (()+0xcb166) [0x7f75024ad166]
 7: (()+0xcb193) [0x7f75024ad193]
 8: (()+0xcb28e) [0x7f75024ad28e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x940) [0x77d550]
 10: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, 
int)+0x1eae) [0x649cce]
 11: (PG::RecoveryState::Stray::react(PG::RecoveryState::MLogRec 
const&)+0x2b1) [0x649fc1]
 12: (boost::statechart::simple_state<PG::RecoveryState::Stray, 
PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na>, 
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base 
const&, void const*)+0x203) [0x660343]
 13: 
(boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, 
PG::RecoveryState::Initial, std::allocator<void>, 
boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base 
const&)+0x6b) [0x6580eb]
 14: (PG::RecoveryState::handle_log(int, MOSDPGLog*, 
PG::RecoveryCtx*)+0x190) [0x6139d0]
 15: (OSD::handle_pg_log(std::tr1::shared_ptr<OpRequest>)+0x666) [0x5cec66]
 16: (OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>)+0x11b) [0x5d312b]
 17: (OSD::_dispatch(Message*)+0x173) [0x5dc273]
 18: (OSD::ms_dispatch(Message*)+0x1e7) [0x5dcba7]
 19: (SimpleMessenger::dispatch_entry()+0x979) [0x7d60a9]
 20: (SimpleMessenger::DispatchThread::entry()+0xd) [0x72781d]
 21: (()+0x68ca) [0x7f75036338ca]
 22: (clone()+0x6d) [0x7f7501cb792d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

The root cause can be btrfs... or maybe not. I don't see any btrfs crash 
oops, just :

Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479173] 
Pid: 16875, comm: kworker/7:0 Tainted: G        W    3.4.0-dsiun-120521 #108
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479218] 
Call Trace:
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479243] 
[<ffffffff81039f1b>] ? warn_slowpath_common+0x7b/0xc0
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479278] 
[<ffffffffa026fca5>] ? btrfs_orphan_commit_root+0x105/0x110 [btrfs]
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479328] 
[<ffffffffa026965a>] ? commit_fs_roots.isra.22+0xaa/0x170 [btrfs]
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479379] 
[<ffffffffa02bc9a0>] ? btrfs_scrub_pause+0xf0/0x100 [btrfs]
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479415] 
[<ffffffffa026a6f1>] ? btrfs_commit_transaction+0x521/0x9d0 [btrfs]
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479460] 
[<ffffffff8105a9f0>] ? add_wait_queue+0x60/0x60
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479493] 
[<ffffffffa026aba0>] ? btrfs_commit_transaction+0x9d0/0x9d0 [btrfs]
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479543] 
[<ffffffffa026abb1>] ? do_async_commit+0x11/0x20 [btrfs]
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479572] 
[<ffffffff810548a7>] ? process_one_work+0x107/0x460
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479601] 
[<ffffffff81055a8e>] ? worker_thread+0x14e/0x330
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479628] 
[<ffffffff81055940>] ? manage_workers.isra.28+0x210/0x210
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479657] 
[<ffffffff8105a005>] ? kthread+0x85/0x90
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479684] 
[<ffffffff813be3e4>] ? kernel_thread_helper+0x4/0x10
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479712] 
[<ffffffff81059f80>] ? kthread_freezable_should_stop+0x60/0x60
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479741] 
[<ffffffff813be3e0>] ? gs_change+0x13/0x13
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479767] 
---[ end trace 303c47aab4b5d025 ]---
Jun  3 00:44:11 chichibu.u14.univ-nantes.prive kernel: [204497.711101] 
------------[ cut here ]------------

But this is just a warn (maybe that could lead to kernel oops/crash). 
Seems to have been fixed lately in git kernels.

I can give you all 8 logs of OSD + logs of MDS & MON if it can help.

Cheers,
--

Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@xxxxxxxxxxxxxx

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html