Re: Ceph osd's crashing repeatedly

"sascha a." <sascha.arthur@xxxxxxxxx> · Wed, 13 Nov 2019 18:18:02 +0100

Hello,
When i posted several days ago a crash nobody respondet as well. So i want to share my thoughts and maybe help you to find it (even im prett new to ceph and its code)

What i would do i your case:

- git checkout ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
from github

- imo your crash is happening to a failed assert close to:

Nov 13 16:26:13 cn5 numactl: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) [0x5649f7348d43]

Nov 13 16:26:13 cn5 numactl: 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x5649f7348ec2]

Nov 13 16:26:13 cn5 numactl: 6: (()+0x8e7e60) [0x5649f77c3e60]

Nov 13 16:26:13 cn5 numactl: 7: (CallClientContexts::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x6b9) [0x5649f77d5bf9]

- search in the code for 
CallClientContexts::finish and assets which could go wrong.
- Try to figure out why its failing and why the given assert could go wrong.

On my way i builded the monitors (src) myself with more debugging information until i was able to solve it.

Hope it helps you out.

Greetings
Sascha

nokia ceph <nokiacephusers@xxxxxxxxx> schrieb am Mi., 13. Nov. 2019, 17:28:
Hi,

We have upgraded a 5 node ceph cluster from Luminous to Nautilus and the cluster was running fine. Yesterday when we tried to add one more osd into the ceph cluster we find that the OSD is created in the cluster but suddenly some of the other OSD's started to crash and we are not able to restart any of the OSD's in that particular node where we found this issue. Due to this we are not able to add the OSD's in other node and we are not able to bring up the cluster.

The logs which are shown during the crash is below.

Nov 13 16:26:13 cn5 numactl: ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
Nov 13 16:26:13 cn5 numactl: 1: (()+0xf5d0) [0x7f488bb0f5d0]
Nov 13 16:26:13 cn5 numactl: 2: (gsignal()+0x37) [0x7f488a8ff207]
Nov 13 16:26:13 cn5 numactl: 3: (abort()+0x148) [0x7f488a9008f8]
Nov 13 16:26:13 cn5 numactl: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) [0x5649f7348d43]
Nov 13 16:26:13 cn5 numactl: 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x5649f7348ec2]
Nov 13 16:26:13 cn5 numactl: 6: (()+0x8e7e60) [0x5649f77c3e60]
Nov 13 16:26:13 cn5 numactl: 7: (CallClientContexts::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x6b9) [0x5649f77d5bf9]
Nov 13 16:26:13 cn5 numactl: 8: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x8c) [0x5649f77ab02c]
Nov 13 16:26:13 cn5 numactl: 9: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*, ZTracer::Trace const&)+0xd57) [0x5649f77c5627]
Nov 13 16:26:13 cn5 numactl: 10: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x9f) [0x5649f77c60af]
Nov 13 16:26:13 cn5 numactl: 11: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x87) [0x5649f76a3467]
Nov 13 16:26:13 cn5 numactl: 12: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x695) [0x5649f764f365]
Nov 13 16:26:13 cn5 numactl: 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1a9) [0x5649f7489ea9]
Nov 13 16:26:13 cn5 numactl: 14: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x5649f77275d2]
Nov 13 16:26:13 cn5 numactl: 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4) [0x5649f74a6ef4]
Nov 13 16:26:13 cn5 numactl: 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0x5649f7aa5ce3]
Nov 13 16:26:13 cn5 numactl: 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5649f7aa8d80]
Nov 13 16:26:13 cn5 numactl: 18: (()+0x7dd5) [0x7f488bb07dd5]
Nov 13 16:26:13 cn5 numactl: 19: (clone()+0x6d) [0x7f488a9c6ead]
Nov 13 16:26:13 cn5 numactl: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Nov 13 16:26:13 cn5 systemd: ceph-osd@279.service: main process exited, code=killed, status=6/ABRT

Could you please let us know what might be the issue and how to debug this?

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com