Dear Ceph-users, in the meantime I found this ticket which seems to have the same assertion / stacktrace but was solved: https://tracker.ceph.com/issues/44532 Anyone have any ideas how it could still happen in 16.2.7? Greetings André ----- Am 17. Apr 2023 um 10:30 schrieb Andre Gemuend andre.gemuend@xxxxxxxxxxxxxxxxxx: > Dear Ceph-users, > > we have trouble with a Ceph cluster after a full shutdown. A couple of OSDs > don't start anymore, exiting with SIGABRT very quickly. With debug logs and > lots of work (I find cephadm clusters hard to debug btw) we received the > following stack trace: > > debug -16> 2023-04-14T11:52:17.617+0000 7f10ab4d2700 -1 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.7/rpm/el8/BUILD/ceph-16.2.7/src/osd/PGLog.h: > In function 'void PGLog::IndexedLog::add(const pg_log_entry_t&, bool)' thread > 7f10ab4d2700 time 2023-04-14T11:52:17.614095+0000 > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.7/rpm/el8/BUILD/ceph-16.2.7/src/osd/PGLog.h: > 607: FAILED ceph_assert(head.version == 0 || e.version.version > head.version) > > > ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable) > > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) > [0x55b2dafc7b7e] > > 2: /usr/bin/ceph-osd(+0x56ad98) [0x55b2dafc7d98] > > 3: (bool PGLog::append_log_entries_update_missing<pg_missing_set<true> > >(hobject_t const&, std::__cxx11::list<pg_log_entry_t, > mempool::pool_allocator<(mempool::pool_index_t)22, pg_log_entry_t> > const&, > bool, PGLog::IndexedLog*, pg_missing_set<true>&, PGLog::LogEntryHandler*, > DoutPrefixProvider const*)+0xc19) [0x55b2db1bb6b9] > > 4: (PGLog::merge_log(pg_info_t&, pg_log_t&&, pg_shard_t, pg_info_t&, > PGLog::LogEntryHandler*, bool&, bool&)+0xee2) [0x55b2db1adf22] > > 5: (PeeringState::merge_log(ceph::os::Transaction&, pg_info_t&, pg_log_t&&, > pg_shard_t)+0x75) [0x55b2db33c165] > > 6: (PeeringState::Stray::react(MLogRec const&)+0xcc) [0x55b2db37adec] > > 7: (boost::statechart::simple_state<PeeringState::Stray, PeeringState::Started, > boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na>, > (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base > const&, void const*)+0xd5) [0x55b2db3a6e65] > > 8: (boost::statechart::state_machine<PeeringState::PeeringMachine, > PeeringState::Initial, std::allocator<boost::statechart::none>, > boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base > const&)+0x5b) [0x55b2db18ef6b] > > 9: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x2d1) > [0x55b2db1839e1] > > 10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, > ThreadPool::TPHandle&)+0x29c) [0x55b2db0fde5c] > > 11: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, > boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x55b2db32d0e6] > > 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) > [0x55b2db0efd48] > > 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) > [0x55b2db7615b4] > > 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55b2db764254] > > 15: /lib64/libpthread.so.0(+0x817f) [0x7f10cef1117f] > > 16: clone() > > debug -15> 2023-04-14T11:52:17.618+0000 7f10b64e8700 3 osd.70 72507 > handle_osd_map epochs [72507,72507], i have 72507, src has [68212,72507] > > debug -14> 2023-04-14T11:52:17.619+0000 7f10b64e8700 3 osd.70 72507 > handle_osd_map epochs [72507,72507], i have 72507, src has [68212,72507] > > debug -13> 2023-04-14T11:52:17.619+0000 7f10ac4d4700 5 osd.70 pg_epoch: > 72507 pg[18.7( v 64162'106 (0'0,64162'106] local-lis/les=72506/72507 n=14 > ec=17104/17104 lis/c=72506/72480 les/c/f=72507/72481/0 sis=72506 > pruub=9.160680771s) [70,86,41] r=0 lpr=72506 pi=[72480,72506)/1 crt=64162'106 > lcod 0'0 mlcod 0'0 active+wait pruub 12.822580338s@ mbc={}] exit > Started/Primary/Active/Activating 0.011269 7 0.000114 > > # ceph versions > { > "mon": { > "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific > (stable)": 5 > }, > "mgr": { > "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific > (stable)": 2 > }, > "osd": { > "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific > (stable)": 92 > }, > "mds": { > "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific > (stable)": 2 > }, > "rgw": { > "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific > (stable)": 2 > }, > "overall": { > "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific > (stable)": 103 > } > } > > > Another thing is that things like `ceph -s`, `ceph osd tree`, `rbd ls`, etc. > work, but `ceph orch ps` (or generally any orch commands) simply hang forever, > seemingly in a futex waiting on a socket to the mons. > > If anyone has any ideas how we could get those OSDs back online, I'd be very > grateful for any hints. I'm also on slack. > > Greetings > -- > André Gemünd, Leiter IT / Head of IT > Fraunhofer-Institute for Algorithms and Scientific Computing > andre.gemuend@xxxxxxxxxxxxxxxxxx > Tel: +49 2241 14-4199 > /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx -- André Gemünd, Leiter IT / Head of IT Fraunhofer-Institute for Algorithms and Scientific Computing andre.gemuend@xxxxxxxxxxxxxxxxxx Tel: +49 2241 14-4199 /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx