On Wed, Sep 27, 2017 at 12:15 PM, Richard Hesketh <richard.hesketh@xxxxxxxxxxxx> wrote: > As the subject says... any ceph fs administrative command I try to run hangs forever and kills monitors in the background - sometimes they come back, on a couple of occasions I had to manually stop/restart a suffering mon. Trying to load the filesystem tab in the ceph-mgr dashboard dumps an error and can also kill a monitor. However, clients can mount the filesystem and read/write data without issue. > > Relevant excerpt from logs on an affected monitor, just trying to run 'ceph fs ls': > > 2017-09-26 13:20:50.716087 7fc85fdd9700 0 mon.vm-ds-01@0(leader) e19 handle_command mon_command({"prefix": "fs ls"} v 0) v1 > 2017-09-26 13:20:50.727612 7fc85fdd9700 0 log_channel(audit) log [DBG] : from='client.? 10.10.10.1:0/2771553898' entity='client.admin' cmd=[{"prefix": "fs ls"}]: dispatch > 2017-09-26 13:20:50.950373 7fc85fdd9700 -1 /build/ceph-12.2.0/src/osd/OSDMap.h: In function 'const string& OSDMap::get_pool_name(int64_t) const' thread 7fc85fdd9700 time 2017-09-26 13:20:50.727676 > /build/ceph-12.2.0/src/osd/OSDMap.h: 1176: FAILED assert(i != pool_name.end()) > > ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55a8ca0bb642] > 2: (()+0x48165f) [0x55a8c9f4165f] > 3: (MDSMonitor::preprocess_command(boost::intrusive_ptr<MonOpRequest>)+0x1d18) [0x55a8ca047688] > 4: (MDSMonitor::preprocess_query(boost::intrusive_ptr<MonOpRequest>)+0x2a8) [0x55a8ca048008] > 5: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x700) [0x55a8c9f9d1b0] > 6: (Monitor::handle_command(boost::intrusive_ptr<MonOpRequest>)+0x1f93) [0x55a8c9e63193] > 7: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xa0e) [0x55a8c9e6a52e] > 8: (Monitor::_ms_dispatch(Message*)+0x6db) [0x55a8c9e6b57b] > 9: (Monitor::ms_dispatch(Message*)+0x23) [0x55a8c9e9a053] > 10: (DispatchQueue::entry()+0xf4a) [0x55a8ca3b5f7a] > 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x55a8ca16bc1d] > 12: (()+0x76ba) [0x7fc86b3ac6ba] > 13: (clone()+0x6d) [0x7fc869bd63dd] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > > I'm running Luminous. The cluster and FS have been in service since Hammer and have default data/metadata pool names. I discovered the issue after attempting to enable directory sharding. Well that's not good... The assertion is because your FSMap is referring to a pool that apparently no longer exists in the OSDMap. This should be impossible in current Ceph (we forbid removing pools if they're in use), but could perhaps have been caused in an earlier version of Ceph when it was possible to remove a pool even if CephFS was referring to it? Alternatively, perhaps something more severe is going on that's causing your mons to see a wrong/inconsistent view of the world. Has the cluster ever been through any traumatic disaster recovery type activity involving hand-editing any of the cluster maps? What intermediate versions has it passed through on the way from Hammer to Luminous? Opened a ticket here: http://tracker.ceph.com/issues/21568 John > Rich > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com