Ceph osd's crashing repeatedly

nokia ceph <nokiacephusers@xxxxxxxxx> · Wed, 13 Nov 2019 22:02:49 +0530

Hi,

We have upgraded a 5 node ceph cluster from Luminous to Nautilus and the cluster was running fine. Yesterday when we tried to add one more osd into the ceph cluster we find that the OSD is created in the cluster but suddenly some of the other OSD's started to crash and we are not able to restart any of the OSD's in that particular node where we found this issue. Due to this we are not able to add the OSD's in other node and we are not able to bring up the cluster.

The logs which are shown during the crash is below.

Nov 13 16:26:13 cn5 numactl: ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
Nov 13 16:26:13 cn5 numactl: 1: (()+0xf5d0) [0x7f488bb0f5d0]
Nov 13 16:26:13 cn5 numactl: 2: (gsignal()+0x37) [0x7f488a8ff207]
Nov 13 16:26:13 cn5 numactl: 3: (abort()+0x148) [0x7f488a9008f8]
Nov 13 16:26:13 cn5 numactl: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) [0x5649f7348d43]
Nov 13 16:26:13 cn5 numactl: 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x5649f7348ec2]
Nov 13 16:26:13 cn5 numactl: 6: (()+0x8e7e60) [0x5649f77c3e60]
Nov 13 16:26:13 cn5 numactl: 7: (CallClientContexts::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x6b9) [0x5649f77d5bf9]
Nov 13 16:26:13 cn5 numactl: 8: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x8c) [0x5649f77ab02c]
Nov 13 16:26:13 cn5 numactl: 9: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*, ZTracer::Trace const&)+0xd57) [0x5649f77c5627]
Nov 13 16:26:13 cn5 numactl: 10: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x9f) [0x5649f77c60af]
Nov 13 16:26:13 cn5 numactl: 11: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x87) [0x5649f76a3467]
Nov 13 16:26:13 cn5 numactl: 12: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x695) [0x5649f764f365]
Nov 13 16:26:13 cn5 numactl: 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1a9) [0x5649f7489ea9]
Nov 13 16:26:13 cn5 numactl: 14: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x5649f77275d2]
Nov 13 16:26:13 cn5 numactl: 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4) [0x5649f74a6ef4]
Nov 13 16:26:13 cn5 numactl: 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0x5649f7aa5ce3]
Nov 13 16:26:13 cn5 numactl: 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5649f7aa8d80]
Nov 13 16:26:13 cn5 numactl: 18: (()+0x7dd5) [0x7f488bb07dd5]
Nov 13 16:26:13 cn5 numactl: 19: (clone()+0x6d) [0x7f488a9c6ead]
Nov 13 16:26:13 cn5 numactl: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Nov 13 16:26:13 cn5 systemd: ceph-osd@279.service: main process exited, code=killed, status=6/ABRT

Could you please let us know what might be the issue and how to debug this?

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com