Troubleshooting cephadm OSDs aborting start

André Gemünd <andre.gemuend@xxxxxxxxxxxxxxxxxx> · Mon, 17 Apr 2023 10:30:20 +0200 (CEST)

Dear Ceph-users,

we have trouble with a Ceph cluster after a full shutdown. A couple of OSDs don't start anymore, exiting with SIGABRT very quickly. With debug logs and lots of work (I find cephadm clusters hard to debug btw) we received the following stack trace:

debug    -16> 2023-04-14T11:52:17.617+0000 7f10ab4d2700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.7/rpm/el8/BUILD/ceph-16.2.7/src/osd/PGLog.h: In function 'void PGLog::IndexedLog::add(const pg_log_entry_t&, bool)' thread 7f10ab4d2700 time 2023-04-14T11:52:17.614095+0000

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.7/rpm/el8/BUILD/ceph-16.2.7/src/osd/PGLog.h: 607: FAILED ceph_assert(head.version == 0 || e.version.version > head.version)

 ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)

 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x55b2dafc7b7e]

 2: /usr/bin/ceph-osd(+0x56ad98) [0x55b2dafc7d98]

 3: (bool PGLog::append_log_entries_update_missing<pg_missing_set<true> >(hobject_t const&, std::__cxx11::list<pg_log_entry_t, mempool::pool_allocator<(mempool::pool_index_t)22, pg_log_entry_t> > const&, bool, PGLog::IndexedLog*, pg_missing_set<true>&, PGLog::LogEntryHandler*, DoutPrefixProvider const*)+0xc19) [0x55b2db1bb6b9]

 4: (PGLog::merge_log(pg_info_t&, pg_log_t&&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0xee2) [0x55b2db1adf22]

 5: (PeeringState::merge_log(ceph::os::Transaction&, pg_info_t&, pg_log_t&&, pg_shard_t)+0x75) [0x55b2db33c165]

 6: (PeeringState::Stray::react(MLogRec const&)+0xcc) [0x55b2db37adec]

 7: (boost::statechart::simple_state<PeeringState::Stray, PeeringState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xd5) [0x55b2db3a6e65]

 8: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x5b) [0x55b2db18ef6b]

 9: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x2d1) [0x55b2db1839e1]

 10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x29c) [0x55b2db0fde5c]

 11: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x55b2db32d0e6]

 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x55b2db0efd48]

 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55b2db7615b4]

 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55b2db764254]

 15: /lib64/libpthread.so.0(+0x817f) [0x7f10cef1117f]

 16: clone()

debug    -15> 2023-04-14T11:52:17.618+0000 7f10b64e8700  3 osd.70 72507 handle_osd_map epochs [72507,72507], i have 72507, src has [68212,72507]

debug    -14> 2023-04-14T11:52:17.619+0000 7f10b64e8700  3 osd.70 72507 handle_osd_map epochs [72507,72507], i have 72507, src has [68212,72507]

debug    -13> 2023-04-14T11:52:17.619+0000 7f10ac4d4700  5 osd.70 pg_epoch: 72507 pg[18.7( v 64162'106 (0'0,64162'106] local-lis/les=72506/72507 n=14 ec=17104/17104 lis/c=72506/72480 les/c/f=72507/72481/0 sis=72506 pruub=9.160680771s) [70,86,41] r=0 lpr=72506 pi=[72480,72506)/1 crt=64162'106 lcod 0'0 mlcod 0'0 active+wait pruub 12.822580338s@ mbc={}] exit Started/Primary/Active/Activating 0.011269 7 0.000114

# ceph versions
{
    "mon": {
        "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 5
    },
    "mgr": {
        "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 2
    },
    "osd": {
        "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 92
    },
    "mds": {
        "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 2
    },
    "rgw": {
        "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 2
    },
    "overall": {
        "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)": 103
    }
}

Another thing is that things like `ceph -s`, `ceph osd tree`, `rbd ls`,  etc. work, but `ceph orch ps` (or generally any orch commands) simply hang forever, seemingly in a futex waiting on a socket to the mons. 

If anyone has any ideas how we could get those OSDs back online, I'd be very grateful for any hints. I'm also on slack.

Greetings
-- 
André Gemünd, Leiter IT / Head of IT
Fraunhofer-Institute for Algorithms and Scientific Computing
andre.gemuend@xxxxxxxxxxxxxxxxxx
Tel: +49 2241 14-4199
/C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx