Consistent OSD crashes for ceph 17.2.5 which is causing osd up and down

Akash Warkhade <a.warkhade98@xxxxxxxxx> · Wed, 27 Dec 2023 23:47:07 +0530

We are running rook-ceph deployed as a operator in kubernetes with rook
version 1.10.8 and ceph 17.2.5.

Its working fine but we are seeing frequent OSD daemon crash in 3-4 days
and restarts without any problem also we are seeing flapping osds i.e osd
up down.

Recently daemon crash happened for 2 OSDs at same time on different nodes
with below error in crash info :

-305> 2023-12-17T14:50:14.413+0000 7f53b5f91700 -1 *** Caught signal
(Aborted) **
 in thread 7f53b5f91700 thread_name:tp_osd_tp
 ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)
 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f53d93ddcf0]
 2: gsignal()
 3: abort()
 4: /lib64/libc.so.6(+0x21d79) [0x7f53d8025d79]
 5: /lib64/libc.so.6(+0x47456) [0x7f53d804b456]
 6: (MOSDRepOp::encode_payload(unsigned long)+0x2d0) [0x55acc0f81730]
 7: (Message::encode(unsigned long, int, bool)+0x2e) [0x55acc140ec2e]
 8: (ProtocolV2::send_message(Message*)+0x25e) [0x55acc16a5aae]
 9: (AsyncConnection::send_message(Message*)+0x18e) [0x55acc167dc4e]
 10: (OSDService::send_message_osd_cluster(int, Message*, unsigned
int)+0x2bd) [0x55acc0b4b11d]
 11: (ReplicatedBackend::issue_op(hobject_t const&, eversion_t const&,
unsigned long, osd_reqid_t, eversion_t, eversion_t, hobject_t,
hobject_t, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t>
> const&, std::optional<pg_hit_set_history_t>&,
ReplicatedBackend::InProgressOp*, ceph::os::Transaction&)+0x6c8)
[0x55acc0f69368]
 12: (ReplicatedBackend::submit_transaction(hobject_t const&,
object_stat_sum_t const&, eversion_t const&,
std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&,
eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t,
std::allocator<pg_log_entry_t> >&&,
std::optional<pg_hit_set_history_t>&, Context*, unsigned long,
osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x5e7) [0x55acc0f6c907]
 13: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*,
PrimaryLogPG::OpContext*)+0x50d) [0x55acc0c92ebd]
 14: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xd25)
[0x55acc0cf0295]
 15: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x288d)
[0x55acc0cf78fd]
 16: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1c0)
[0x55acc0b56900]
 17: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x6d)
[0x55acc0e552ad]
 18: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x115f) [0x55acc0b69dbf]
 19: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x435)
[0x55acc12c78c5]
 20: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55acc12c9fe4]
 21: /lib64/libpthread.so.0(+0x81ca) [0x7f53d93d31ca]
 22: clone()

It has also has below errors before the crash:
scrub-queue::*remove_from_osd_queue* removing pg[2.4f0] failed. State was:
unregistering

Please help to troubleshoot the issue and fix it

Already posted on ceph tracker but no reply there since 3-4 days
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx