Re: Ceph osd's crashing repeatedly

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

When i posted several days ago a crash nobody respondet as well. So i want to share my thoughts and maybe help you to find it (even im prett new to ceph and its code)


What i would do i your case:

- git checkout ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
from github

- imo your crash is happening to a failed assert close to:


Nov 13 16:26:13 cn5 numactl: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) [0x5649f7348d43]

Nov 13 16:26:13 cn5 numactl: 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x5649f7348ec2]

Nov 13 16:26:13 cn5 numactl: 6: (()+0x8e7e60) [0x5649f77c3e60]

Nov 13 16:26:13 cn5 numactl: 7: (CallClientContexts::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x6b9) [0x5649f77d5bf9]

- search in the code for 
CallClientContexts::finish and assets which could go wrong.

- Try to figure out why its failing and why the given assert could go wrong.


On my way i builded the monitors (src) myself with more debugging information until i was able to solve it.

Hope it helps you out.

Greetings
Sascha

nokia ceph <nokiacephusers@xxxxxxxxx> schrieb am Mi., 13. Nov. 2019, 17:28:
Hi,

We have upgraded a 5 node ceph cluster from Luminous to Nautilus and the cluster was running fine. Yesterday when we tried to add one more osd into the ceph cluster we find that the OSD is created in the cluster but suddenly some of the other OSD's started to crash and we are not able to restart any of the OSD's in that particular node where we found this issue. Due to this we are not able to add the OSD's in other node and we are not able to bring up the cluster.

The logs which are shown during the crash is below.


Nov 13 16:26:13 cn5 numactl: ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
Nov 13 16:26:13 cn5 numactl: 1: (()+0xf5d0) [0x7f488bb0f5d0]
Nov 13 16:26:13 cn5 numactl: 2: (gsignal()+0x37) [0x7f488a8ff207]
Nov 13 16:26:13 cn5 numactl: 3: (abort()+0x148) [0x7f488a9008f8]
Nov 13 16:26:13 cn5 numactl: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) [0x5649f7348d43]
Nov 13 16:26:13 cn5 numactl: 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x5649f7348ec2]
Nov 13 16:26:13 cn5 numactl: 6: (()+0x8e7e60) [0x5649f77c3e60]
Nov 13 16:26:13 cn5 numactl: 7: (CallClientContexts::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x6b9) [0x5649f77d5bf9]
Nov 13 16:26:13 cn5 numactl: 8: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x8c) [0x5649f77ab02c]
Nov 13 16:26:13 cn5 numactl: 9: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*, ZTracer::Trace const&)+0xd57) [0x5649f77c5627]
Nov 13 16:26:13 cn5 numactl: 10: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x9f) [0x5649f77c60af]
Nov 13 16:26:13 cn5 numactl: 11: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x87) [0x5649f76a3467]
Nov 13 16:26:13 cn5 numactl: 12: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x695) [0x5649f764f365]
Nov 13 16:26:13 cn5 numactl: 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1a9) [0x5649f7489ea9]
Nov 13 16:26:13 cn5 numactl: 14: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x5649f77275d2]
Nov 13 16:26:13 cn5 numactl: 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4) [0x5649f74a6ef4]
Nov 13 16:26:13 cn5 numactl: 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0x5649f7aa5ce3]
Nov 13 16:26:13 cn5 numactl: 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5649f7aa8d80]
Nov 13 16:26:13 cn5 numactl: 18: (()+0x7dd5) [0x7f488bb07dd5]
Nov 13 16:26:13 cn5 numactl: 19: (clone()+0x6d) [0x7f488a9c6ead]
Nov 13 16:26:13 cn5 numactl: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Nov 13 16:26:13 cn5 systemd: ceph-osd@279.service: main process exited, code=killed, status=6/ABRT


Could you please let us know what might be the issue and how to debug this?
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux