Re: Troubleshooting cephadm OSDs aborting start

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear Ceph-users,

in the meantime I found this ticket which seems to have the same assertion / stacktrace but was solved: https://tracker.ceph.com/issues/44532

Anyone have any ideas how it could still happen in 16.2.7?

Greetings
André


----- Am 17. Apr 2023 um 10:30 schrieb Andre Gemuend andre.gemuend@xxxxxxxxxxxxxxxxxx:

> Dear Ceph-users,
> 
> we have trouble with a Ceph cluster after a full shutdown. A couple of OSDs
> don't start anymore, exiting with SIGABRT very quickly. With debug logs and
> lots of work (I find cephadm clusters hard to debug btw) we received the
> following stack trace:
> 
> debug    -16> 2023-04-14T11:52:17.617+0000 7f10ab4d2700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.7/rpm/el8/BUILD/ceph-16.2.7/src/osd/PGLog.h:
> In function 'void PGLog::IndexedLog::add(const pg_log_entry_t&, bool)' thread
> 7f10ab4d2700 time 2023-04-14T11:52:17.614095+0000
> 
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.7/rpm/el8/BUILD/ceph-16.2.7/src/osd/PGLog.h:
> 607: FAILED ceph_assert(head.version == 0 || e.version.version > head.version)
> 
> 
> ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)
> 
> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158)
> [0x55b2dafc7b7e]
> 
> 2: /usr/bin/ceph-osd(+0x56ad98) [0x55b2dafc7d98]
> 
> 3: (bool PGLog::append_log_entries_update_missing<pg_missing_set<true>
> >(hobject_t const&, std::__cxx11::list<pg_log_entry_t,
> mempool::pool_allocator<(mempool::pool_index_t)22, pg_log_entry_t> > const&,
> bool, PGLog::IndexedLog*, pg_missing_set<true>&, PGLog::LogEntryHandler*,
> DoutPrefixProvider const*)+0xc19) [0x55b2db1bb6b9]
> 
> 4: (PGLog::merge_log(pg_info_t&, pg_log_t&&, pg_shard_t, pg_info_t&,
> PGLog::LogEntryHandler*, bool&, bool&)+0xee2) [0x55b2db1adf22]
> 
> 5: (PeeringState::merge_log(ceph::os::Transaction&, pg_info_t&, pg_log_t&&,
> pg_shard_t)+0x75) [0x55b2db33c165]
> 
> 6: (PeeringState::Stray::react(MLogRec const&)+0xcc) [0x55b2db37adec]
> 
> 7: (boost::statechart::simple_state<PeeringState::Stray, PeeringState::Started,
> boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na>,
> (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
> const&, void const*)+0xd5) [0x55b2db3a6e65]
> 
> 8: (boost::statechart::state_machine<PeeringState::PeeringMachine,
> PeeringState::Initial, std::allocator<boost::statechart::none>,
> boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
> const&)+0x5b) [0x55b2db18ef6b]
> 
> 9: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x2d1)
> [0x55b2db1839e1]
> 
> 10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>,
> ThreadPool::TPHandle&)+0x29c) [0x55b2db0fde5c]
> 
> 11: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*,
> boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x55b2db32d0e6]
> 
> 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28)
> [0x55b2db0efd48]
> 
> 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4)
> [0x55b2db7615b4]
> 
> 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55b2db764254]
> 
> 15: /lib64/libpthread.so.0(+0x817f) [0x7f10cef1117f]
> 
> 16: clone()
> 
> debug    -15> 2023-04-14T11:52:17.618+0000 7f10b64e8700  3 osd.70 72507
> handle_osd_map epochs [72507,72507], i have 72507, src has [68212,72507]
> 
> debug    -14> 2023-04-14T11:52:17.619+0000 7f10b64e8700  3 osd.70 72507
> handle_osd_map epochs [72507,72507], i have 72507, src has [68212,72507]
> 
> debug    -13> 2023-04-14T11:52:17.619+0000 7f10ac4d4700  5 osd.70 pg_epoch:
> 72507 pg[18.7( v 64162'106 (0'0,64162'106] local-lis/les=72506/72507 n=14
> ec=17104/17104 lis/c=72506/72480 les/c/f=72507/72481/0 sis=72506
> pruub=9.160680771s) [70,86,41] r=0 lpr=72506 pi=[72480,72506)/1 crt=64162'106
> lcod 0'0 mlcod 0'0 active+wait pruub 12.822580338s@ mbc={}] exit
> Started/Primary/Active/Activating 0.011269 7 0.000114
> 
> # ceph versions
> {
>    "mon": {
>        "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific
>        (stable)": 5
>    },
>    "mgr": {
>        "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific
>        (stable)": 2
>    },
>    "osd": {
>        "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific
>        (stable)": 92
>    },
>    "mds": {
>        "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific
>        (stable)": 2
>    },
>    "rgw": {
>        "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific
>        (stable)": 2
>    },
>    "overall": {
>        "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific
>        (stable)": 103
>    }
> }
> 
> 
> Another thing is that things like `ceph -s`, `ceph osd tree`, `rbd ls`,  etc.
> work, but `ceph orch ps` (or generally any orch commands) simply hang forever,
> seemingly in a futex waiting on a socket to the mons.
> 
> If anyone has any ideas how we could get those OSDs back online, I'd be very
> grateful for any hints. I'm also on slack.
> 
> Greetings
> --
> André Gemünd, Leiter IT / Head of IT
> Fraunhofer-Institute for Algorithms and Scientific Computing
> andre.gemuend@xxxxxxxxxxxxxxxxxx
> Tel: +49 2241 14-4199
> /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
André Gemünd, Leiter IT / Head of IT
Fraunhofer-Institute for Algorithms and Scientific Computing
andre.gemuend@xxxxxxxxxxxxxxxxxx
Tel: +49 2241 14-4199
/C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux