Dear Ceph-users,
we have trouble with a Ceph cluster after a full shutdown. A couple of OSDs
don't start anymore, exiting with SIGABRT very quickly. With debug logs and
lots of work (I find cephadm clusters hard to debug btw) we received the
following stack trace:
debug -16> 2023-04-14T11:52:17.617+0000 7f10ab4d2700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.7/rpm/el8/BUILD/ceph-16.2.7/src/osd/PGLog.h:
In function 'void PGLog::IndexedLog::add(const pg_log_entry_t&, bool)' thread
7f10ab4d2700 time 2023-04-14T11:52:17.614095+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.7/rpm/el8/BUILD/ceph-16.2.7/src/osd/PGLog.h:
607: FAILED ceph_assert(head.version == 0 || e.version.version > head.version)
ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158)
[0x55b2dafc7b7e]
2: /usr/bin/ceph-osd(+0x56ad98) [0x55b2dafc7d98]
3: (bool PGLog::append_log_entries_update_missing<pg_missing_set<true>
(hobject_t const&, std::__cxx11::list<pg_log_entry_t,
mempool::pool_allocator<(mempool::pool_index_t)22, pg_log_entry_t> > const&,
bool, PGLog::IndexedLog*, pg_missing_set<true>&, PGLog::LogEntryHandler*,
DoutPrefixProvider const*)+0xc19) [0x55b2db1bb6b9]
4: (PGLog::merge_log(pg_info_t&, pg_log_t&&, pg_shard_t, pg_info_t&,
PGLog::LogEntryHandler*, bool&, bool&)+0xee2) [0x55b2db1adf22]
5: (PeeringState::merge_log(ceph::os::Transaction&, pg_info_t&, pg_log_t&&,
pg_shard_t)+0x75) [0x55b2db33c165]
6: (PeeringState::Stray::react(MLogRec const&)+0xcc) [0x55b2db37adec]
7: (boost::statechart::simple_state<PeeringState::Stray, PeeringState::Started,
boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na>,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
const&, void const*)+0xd5) [0x55b2db3a6e65]
8: (boost::statechart::state_machine<PeeringState::PeeringMachine,
PeeringState::Initial, std::allocator<boost::statechart::none>,
boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
const&)+0x5b) [0x55b2db18ef6b]
9: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x2d1)
[0x55b2db1839e1]
10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>,
ThreadPool::TPHandle&)+0x29c) [0x55b2db0fde5c]
11: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*,
boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x55b2db32d0e6]
12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28)
[0x55b2db0efd48]
13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4)
[0x55b2db7615b4]
14: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55b2db764254]
15: /lib64/libpthread.so.0(+0x817f) [0x7f10cef1117f]
16: clone()
debug -15> 2023-04-14T11:52:17.618+0000 7f10b64e8700 3 osd.70 72507
handle_osd_map epochs [72507,72507], i have 72507, src has [68212,72507]
debug -14> 2023-04-14T11:52:17.619+0000 7f10b64e8700 3 osd.70 72507
handle_osd_map epochs [72507,72507], i have 72507, src has [68212,72507]
debug -13> 2023-04-14T11:52:17.619+0000 7f10ac4d4700 5 osd.70 pg_epoch:
72507 pg[18.7( v 64162'106 (0'0,64162'106] local-lis/les=72506/72507 n=14
ec=17104/17104 lis/c=72506/72480 les/c/f=72507/72481/0 sis=72506
pruub=9.160680771s) [70,86,41] r=0 lpr=72506 pi=[72480,72506)/1 crt=64162'106
lcod 0'0 mlcod 0'0 active+wait pruub 12.822580338s@ mbc={}] exit
Started/Primary/Active/Activating 0.011269 7 0.000114
# ceph versions
{
"mon": {
"ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific
(stable)": 5
},
"mgr": {
"ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific
(stable)": 2
},
"osd": {
"ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific
(stable)": 92
},
"mds": {
"ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific
(stable)": 2
},
"rgw": {
"ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific
(stable)": 2
},
"overall": {
"ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific
(stable)": 103
}
}
Another thing is that things like `ceph -s`, `ceph osd tree`, `rbd ls`, etc.
work, but `ceph orch ps` (or generally any orch commands) simply hang forever,
seemingly in a futex waiting on a socket to the mons.
If anyone has any ideas how we could get those OSDs back online, I'd be very
grateful for any hints. I'm also on slack.
Greetings
--
André Gemünd, Leiter IT / Head of IT
Fraunhofer-Institute for Algorithms and Scientific Computing
andre.gemuend@xxxxxxxxxxxxxxxxxx
Tel: +49 2241 14-4199
/C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx