Ceph osd's crashing repeatedly

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

We have upgraded a 5 node ceph cluster from Luminous to Nautilus and the cluster was running fine. Yesterday when we tried to add one more osd into the ceph cluster we find that the OSD is created in the cluster but suddenly some of the other OSD's started to crash and we are not able to restart any of the OSD's in that particular node where we found this issue. Due to this we are not able to add the OSD's in other node and we are not able to bring up the cluster.

The logs which are shown during the crash is below.


Nov 13 16:26:13 cn5 numactl: ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
Nov 13 16:26:13 cn5 numactl: 1: (()+0xf5d0) [0x7f488bb0f5d0]
Nov 13 16:26:13 cn5 numactl: 2: (gsignal()+0x37) [0x7f488a8ff207]
Nov 13 16:26:13 cn5 numactl: 3: (abort()+0x148) [0x7f488a9008f8]
Nov 13 16:26:13 cn5 numactl: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) [0x5649f7348d43]
Nov 13 16:26:13 cn5 numactl: 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x5649f7348ec2]
Nov 13 16:26:13 cn5 numactl: 6: (()+0x8e7e60) [0x5649f77c3e60]
Nov 13 16:26:13 cn5 numactl: 7: (CallClientContexts::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x6b9) [0x5649f77d5bf9]
Nov 13 16:26:13 cn5 numactl: 8: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x8c) [0x5649f77ab02c]
Nov 13 16:26:13 cn5 numactl: 9: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*, ZTracer::Trace const&)+0xd57) [0x5649f77c5627]
Nov 13 16:26:13 cn5 numactl: 10: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x9f) [0x5649f77c60af]
Nov 13 16:26:13 cn5 numactl: 11: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x87) [0x5649f76a3467]
Nov 13 16:26:13 cn5 numactl: 12: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x695) [0x5649f764f365]
Nov 13 16:26:13 cn5 numactl: 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1a9) [0x5649f7489ea9]
Nov 13 16:26:13 cn5 numactl: 14: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x5649f77275d2]
Nov 13 16:26:13 cn5 numactl: 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4) [0x5649f74a6ef4]
Nov 13 16:26:13 cn5 numactl: 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0x5649f7aa5ce3]
Nov 13 16:26:13 cn5 numactl: 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5649f7aa8d80]
Nov 13 16:26:13 cn5 numactl: 18: (()+0x7dd5) [0x7f488bb07dd5]
Nov 13 16:26:13 cn5 numactl: 19: (clone()+0x6d) [0x7f488a9c6ead]
Nov 13 16:26:13 cn5 numactl: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Nov 13 16:26:13 cn5 systemd: ceph-osd@279.service: main process exited, code=killed, status=6/ABRT


Could you please let us know what might be the issue and how to debug this?
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux