Hello all, I'm having some stability issues with my ceph cluster at the moment. Using CentOS 7, and Ceph 12.2.4. I have osds that are segfaulting regularly. roughly every minute or so, and it seems to be getting worse, now with cascading failures. Backtraces look like this: ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable) 1: (()+0xa3c611) [0x55cb9249c611] 2: (()+0xf5e0) [0x7eff83b495e0] 3: (std::list<boost::tuples::tuple<unsigned long, unsigned long, unsigned int, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>, std::allocator<boost::tuples::tuple<unsigned long, unsigned long, unsigned int, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type> > >::list(std::list<boost::tuples::tuple<unsigned long, unsigned long, unsigned int, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>, std::allocator<boost::tuples::tuple<unsigned long, unsigned long, unsigned int, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type> > > const&)+0x3e) [0x55cb9225562e] 4: (ECBackend::send_all_remaining_reads(hobject_t const&, ECBackend::ReadOp&)+0x33b) [0x55cb92243bab] 5: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*, ZTracer::Trace const&)+0x1074) [0x55cb92245184] 6: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x1af) [0x55cb9224fa2f] 7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) [0x55cb921545f0] 8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x59c) [0x55cb920c004c] 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f9) [0x55cb91f45f69] 10: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x55cb921c2b57] 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xfce) [0x55cb91f749de] 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839) [0x55cb924e1089] 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55cb924e3020] 14: (()+0x7e25) [0x7eff83b41e25] 15: (clone()+0x6d) [0x7eff82c3534d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. When I start a crashed osd, it seems to cause cascading crashes in other osds with the same backtrace. This is making it problematic to keep my placement groups up and active. A full (start to finish) log file is available here: http://people.cs.ksu.edu/~mozes/ceph-osd.44.log Anyone have any thoughts, or workarounds? -- Adam _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com