A colleague asked me to take a look at a Ceph cluster that has stopped working. The "ceph -s" command (any ceph command) just times out. Of three monitors two are crashing with: (gdb) bt #0 0x00007fffee17b7bb in raise () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007fffee166535 in abort () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007fffee530983 in __gnu_cxx::__verbose_terminate_handler () at ../../../../src/libstdc++-v3/libsupc++/vterminate.cc:95 #3 0x00007fffee5368c6 in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:47 #4 0x00007fffee536901 in std::terminate () at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:57 #5 0x00007fffee536b34 in __cxxabiv1::__cxa_throw (obj=obj@entry=0x555557d2be50, tinfo=0x7fffee61a9b8 <typeinfo for std::out_of_range>, dest=0x7fffee54b680 <std::out_of_range::~out_of_range()>) at ../../../../src/libstdc++-v3/libsupc++/eh_throw.cc:95 #6 0x00007fffee53286b in std::__throw_out_of_range (__s=__s@entry=0x5555563cce7f "map::at") from /lib/x86_64-linux-gnu/libstdc++.so.6 #7 0x0000555555f025be in std::map<int, mds_gid_t, std::less<int>, std::allocator<std::pair<int const, mds_gid_t> > >::at (__k=<synthetic pointer>: <optimized out>, this=0x555557dc7a50) at ./src/mon/MDSMonitor.cc:2010 #8 MDSMap::get_info (m=2, this=0x555557dc7818) at ./src/mds/MDSMap.h:448 #9 MDSMonitor::maybe_resize_cluster (this=0x5555580f4580, fsmap=..., fscid=<optimized out>) at ./src/mon/MDSMonitor.cc:2006 #10 0x0000555555f04d51 in MDSMonitor::tick (this=0x5555580f4580) at /usr/include/c++/8/bits/shared_ptr_base.h:1018 #11 0x0000555555eef2f8 in MDSMonitor::on_active (this=0x5555580f4580) at ./src/mon/MDSMonitor.cc:913 #12 0x0000555555e3213d in PaxosService::_active (this=0x5555580f4580) at ./src/mon/PaxosService.cc:356 #13 0x0000555555d29539 in Context::complete (this=0x555556f2d450, r=<optimized out>) at ./src/include/Context.h:99 #14 0x0000555555d54e2d in finish_contexts<std::__cxx11::list<Context*, std::allocator<Context*> > > (cct=0x555556fe2000, Python Exception <class 'ValueError'> Cannot find type class std::__cxx11::list<Context*, std::allocator<Context*> >::_Node: finished=empty std::__cxx11::list, result=result@entry=0) at /usr/include/c++/8/bits/stl_list.h:971 #15 0x0000555555e296d9 in Paxos::finish_round (this=0x555556fceb00) at ./src/mon/Paxos.cc:1083 #16 0x0000555555e2a7f0 in Paxos::handle_last (this=0x555556fceb00, op=...) at ./src/mon/Paxos.cc:589 #17 0x0000555555e2b42f in Paxos::dispatch (this=0x555556fceb00, op=...) at /usr/include/c++/8/bits/atomic_base.h:295 #18 0x0000555555d27413 in Monitor::dispatch_op (this=0x55555810da00, op=...) at /usr/include/c++/8/bits/atomic_base.h:295 #19 0x0000555555d27cce in Monitor::_ms_dispatch (this=0x55555810da00, m=0x55555816c000) at /usr/include/c++/8/bits/atomic_base.h:295 #20 0x0000555555d55a48 in Monitor::ms_dispatch (m=0x55555816c000, this=0x55555810da00) at ./src/mon/Monitor.h:937 #21 Dispatcher::ms_dispatch2 (this=0x55555810da00, m=...) at ./src/msg/Dispatcher.h:124 #22 0x00007fffef783932 in DispatchQueue::entry() () from /usr/lib/ceph/libceph-common.so.2 #23 0x00007fffef82f5dd in DispatchQueue::DispatchThread::entry() () from /usr/lib/ceph/libceph-common.so.2 #24 0x00007fffee68dfa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #25 0x00007fffee23d4cf in clone () from /lib/x86_64-linux-gnu/libc.so.6 >From this I can tell that at this line of code: https://github.com/ceph/ceph/blob/pacific/src/mon/MDSMonitor.cc#L1971 in == 3, thus target == 2 at: https://github.com/ceph/ceph/blob/pacific/src/mon/MDSMonitor.cc#L2006 Which is out of bounds of mds_map. gdb tells me that: mds_map = @0x555557dc7818: {compat = {compat = {mask = 1, names = std::map with 0 elements}, ro_compat = {mask = 1, names = std::map with 0 elements}, incompat = {mask = 1919, names = std::map with 9 elements = {[1] = "base v0.20", [2] = "client writeable ranges", [3] = "default file layouts on dirs", [4] = "dir inode in separate object", [5] = "mds uses versioned encoding", [6] = "dirfrag is stored in omap", [8] = "no anchor table", [9] = "file layout v2", [10] = "snaprealm v2"}}}, epoch = 68021, enabled = true, fs_name = "prod", flags = 18, last_failure = 0, last_failure_osd_epoch = 9953, created = {tv = {tv_sec = 1630197046, tv_nsec = 944134388}}, modified = {tv = {tv_sec = 1633835801, tv_nsec = 859392387}}, tableserver = 0, root = 0, session_timeout = 60, session_autoclose = 300, max_file_size = 1099511627776, required_client_features = {static bits_per_block = 64, _vec = std::vector of length 0, capacity 0}, data_pools = std::vector of length 1, capacity 1 = {5}, cas_pool = -1, metadata_pool = 4, max_mds = 1, old_max_mds = 1, standby_count_wanted = 1, balancer = "", in = std::set with 3 elements = {[0] = 0, [1] = 1, [2] = 2}, failed = std::set with 0 elements, stopped = std::set with 1 element = {[0] = 3}, damaged = std::set with 0 elements, up = std::map with 0 elements, mds_info = std::map with 0 elements, ever_allowed_features = 32 ' ', explicitly_allowed_features = 32 ' ', inline_data_enabled = false, cached_up_features = 0} I observed the metadata servers using strace and they are all in a loop trying to contact monitors (which are not running) and receiving ECONNREFUSED. I am not sure what I need to change to get the monitors to start, I did find this thread which mentions maybe_resize_cluster(), but I am unsure if this information is relevant. https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/KQ5A5OWRIUEOJBC7VILBGDIKPQGJQIWN/ Thank you for any help you may be able to provide. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx