Re: "ceph fs" commands hang forever and kill monitors

John Spray <jspray@xxxxxxxxxx> · Wed, 27 Sep 2017 12:32:04 +0100

On Wed, Sep 27, 2017 at 12:15 PM, Richard Hesketh
<richard.hesketh@xxxxxxxxxxxx> wrote:
> As the subject says... any ceph fs administrative command I try to run hangs forever and kills monitors in the background - sometimes they come back, on a couple of occasions I had to manually stop/restart a suffering mon. Trying to load the filesystem tab in the ceph-mgr dashboard dumps an error and can also kill a monitor. However, clients can mount the filesystem and read/write data without issue.
>
> Relevant excerpt from logs on an affected monitor, just trying to run 'ceph fs ls':
>
> 2017-09-26 13:20:50.716087 7fc85fdd9700  0 mon.vm-ds-01@0(leader) e19 handle_command mon_command({"prefix": "fs ls"} v 0) v1
> 2017-09-26 13:20:50.727612 7fc85fdd9700  0 log_channel(audit) log [DBG] : from='client.? 10.10.10.1:0/2771553898' entity='client.admin' cmd=[{"prefix": "fs ls"}]: dispatch
> 2017-09-26 13:20:50.950373 7fc85fdd9700 -1 /build/ceph-12.2.0/src/osd/OSDMap.h: In function 'const string& OSDMap::get_pool_name(int64_t) const' thread 7fc85fdd9700 time 2017-09-26 13:20:50.727676
> /build/ceph-12.2.0/src/osd/OSDMap.h: 1176: FAILED assert(i != pool_name.end())
>
>  ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55a8ca0bb642]
>  2: (()+0x48165f) [0x55a8c9f4165f]
>  3: (MDSMonitor::preprocess_command(boost::intrusive_ptr<MonOpRequest>)+0x1d18) [0x55a8ca047688]
>  4: (MDSMonitor::preprocess_query(boost::intrusive_ptr<MonOpRequest>)+0x2a8) [0x55a8ca048008]
>  5: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x700) [0x55a8c9f9d1b0]
>  6: (Monitor::handle_command(boost::intrusive_ptr<MonOpRequest>)+0x1f93) [0x55a8c9e63193]
>  7: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xa0e) [0x55a8c9e6a52e]
>  8: (Monitor::_ms_dispatch(Message*)+0x6db) [0x55a8c9e6b57b]
>  9: (Monitor::ms_dispatch(Message*)+0x23) [0x55a8c9e9a053]
>  10: (DispatchQueue::entry()+0xf4a) [0x55a8ca3b5f7a]
>  11: (DispatchQueue::DispatchThread::entry()+0xd) [0x55a8ca16bc1d]
>  12: (()+0x76ba) [0x7fc86b3ac6ba]
>  13: (clone()+0x6d) [0x7fc869bd63dd]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>
> I'm running Luminous. The cluster and FS have been in service since Hammer and have default data/metadata pool names. I discovered the issue after attempting to enable directory sharding.

Well that's not good...

The assertion is because your FSMap is referring to a pool that
apparently no longer exists in the OSDMap.  This should be impossible
in current Ceph (we forbid removing pools if they're in use), but
could perhaps have been caused in an earlier version of Ceph when it
was possible to remove a pool even if CephFS was referring to it?

Alternatively, perhaps something more severe is going on that's
causing your mons to see a wrong/inconsistent view of the world.  Has
the cluster ever been through any traumatic disaster recovery type
activity involving hand-editing any of the cluster maps?  What
intermediate versions has it passed through on the way from Hammer to
Luminous?

Opened a ticket here: http://tracker.ceph.com/issues/21568

John

> Rich
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com