EC related osd crashes (luminous 12.2.4)

Adam Tygart <mozes@xxxxxxx> · Thu, 5 Apr 2018 14:22:05 -0500

Hello all,

I'm having some stability issues with my ceph cluster at the moment.
Using CentOS 7, and Ceph 12.2.4.

I have osds that are segfaulting regularly. roughly every minute or
so, and it seems to be getting worse, now with cascading failures.

Backtraces look like this:
 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b)
luminous (stable)
 1: (()+0xa3c611) [0x55cb9249c611]
 2: (()+0xf5e0) [0x7eff83b495e0]
 3: (std::list<boost::tuples::tuple<unsigned long, unsigned long,
unsigned int, boost::tuples::null_type, boost::tuples::null_type,
boost::tuples::null_type, boost::tuples::null_type,
boost::tuples::null_type, boost::tuples::null_type,
boost::tuples::null_type>,
std::allocator<boost::tuples::tuple<unsigned long, unsigned long,
unsigned int, boost::tuples::null_type, boost::tuples::null_type,
boost::tuples::null_type, boost::tuples::null_type,
boost::tuples::null_type, boost::tuples::null_type,
boost::tuples::null_type> >
>::list(std::list<boost::tuples::tuple<unsigned long, unsigned long,
unsigned int, boost::tuples::null_type, boost::tuples::null_type,
boost::tuples::null_type, boost::tuples::null_type,
boost::tuples::null_type, boost::tuples::null_type,
boost::tuples::null_type>,
std::allocator<boost::tuples::tuple<unsigned long, unsigned long,
unsigned int, boost::tuples::null_type, boost::tuples::null_type,
boost::tuples::null_type, boost::tuples::null_type,
boost::tuples::null_type, boost::tuples::null_type,
boost::tuples::null_type> > > const&)+0x3e) [0x55cb9225562e]
 4: (ECBackend::send_all_remaining_reads(hobject_t const&,
ECBackend::ReadOp&)+0x33b) [0x55cb92243bab]
 5: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&,
RecoveryMessages*, ZTracer::Trace const&)+0x1074) [0x55cb92245184]
 6: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x1af)
[0x55cb9224fa2f]
 7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50)
[0x55cb921545f0]
 8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x59c) [0x55cb920c004c]
 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f9)
[0x55cb91f45f69]
 10: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
const&)+0x57) [0x55cb921c2b57]
 11: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0xfce) [0x55cb91f749de]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839)
[0x55cb924e1089]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55cb924e3020]
 14: (()+0x7e25) [0x7eff83b41e25]
 15: (clone()+0x6d) [0x7eff82c3534d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

When I start a crashed osd, it seems to cause cascading crashes in
other osds with the same backtrace. This is making it problematic to
keep my placement groups up and active.

A full (start to finish) log file is available here:
http://people.cs.ksu.edu/~mozes/ceph-osd.44.log

Anyone have any thoughts, or workarounds?

--
Adam
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com