Help! OSDs across the cluster just crashed

Brett Chancellor <bchancellor@xxxxxxxxxxxxxx> · Tue, 2 Oct 2018 17:23:42 -0400

Help. I have a 60 node cluster and most of the OSDs decided to crash themselves at the same time. They wont restart, the messages look like...
--- begin dump of recent events ---
     0> 2018-10-02 21:19:16.990369 7f57ab5b7d80 -1 *** Caught signal (Aborted) **
 in thread 7f57ab5b7d80 thread_name:ceph-osd

 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
 1: (()+0xa3c611) [0x556d618bb611]
 2: (()+0xf6d0) [0x7f57a885e6d0]
 3: (gsignal()+0x37) [0x7f57a787f277]
 4: (abort()+0x148) [0x7f57a7880968]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x284) [0x556d618fa6e4]
 6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t const&)+0x3b2) [0x556d615c74a2]
 7: (PastIntervals::check_new_interval(int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, unsigned int, unsigned int, std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t, IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380) [0x556d615ae6c0]
 8: (OSD::build_past_intervals_parallel()+0x9ff) [0x556d613707af]
 9: (OSD::load_pgs()+0x545) [0x556d61373095]
 10: (OSD::init()+0x2169) [0x556d613919d9]
 11: (main()+0x2d07) [0x556d61295dd7]
 12: (__libc_start_main()+0xf5) [0x7f57a786b445]
 13: (()+0x4b53e3) [0x556d613343e3]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Some hosts have no working OSDs, others seem to have 1 working, and 2 dead.  It's spread all across the cluster, across several different racks. Any idea on where to look next? The cluster is dead in the water right now.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com