On Thu, Sep 28, 2017 at 11:51 AM, Richard Hesketh <richard.hesketh@xxxxxxxxxxxx> wrote: > On 27/09/17 19:35, John Spray wrote: >> On Wed, Sep 27, 2017 at 1:18 PM, Richard Hesketh >> <richard.hesketh@xxxxxxxxxxxx> wrote: >>> On 27/09/17 12:32, John Spray wrote: >>>> On Wed, Sep 27, 2017 at 12:15 PM, Richard Hesketh >>>> <richard.hesketh@xxxxxxxxxxxx> wrote: >>>>> As the subject says... any ceph fs administrative command I try to run hangs forever and kills monitors in the background - sometimes they come back, on a couple of occasions I had to manually stop/restart a suffering mon. Trying to load the filesystem tab in the ceph-mgr dashboard dumps an error and can also kill a monitor. However, clients can mount the filesystem and read/write data without issue. >>>>> >>>>> Relevant excerpt from logs on an affected monitor, just trying to run 'ceph fs ls': >>>>> >>>>> 2017-09-26 13:20:50.716087 7fc85fdd9700 0 mon.vm-ds-01@0(leader) e19 handle_command mon_command({"prefix": "fs ls"} v 0) v1 >>>>> 2017-09-26 13:20:50.727612 7fc85fdd9700 0 log_channel(audit) log [DBG] : from='client.? 10.10.10.1:0/2771553898' entity='client.admin' cmd=[{"prefix": "fs ls"}]: dispatch >>>>> 2017-09-26 13:20:50.950373 7fc85fdd9700 -1 /build/ceph-12.2.0/src/osd/OSDMap.h: In function 'const string& OSDMap::get_pool_name(int64_t) const' thread 7fc85fdd9700 time 2017-09-26 13:20:50.727676 >>>>> /build/ceph-12.2.0/src/osd/OSDMap.h: 1176: FAILED assert(i != pool_name.end()) >>>>> >>>>> ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc) >>>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55a8ca0bb642] >>>>> 2: (()+0x48165f) [0x55a8c9f4165f] >>>>> 3: (MDSMonitor::preprocess_command(boost::intrusive_ptr<MonOpRequest>)+0x1d18) [0x55a8ca047688] >>>>> 4: (MDSMonitor::preprocess_query(boost::intrusive_ptr<MonOpRequest>)+0x2a8) [0x55a8ca048008] >>>>> 5: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x700) [0x55a8c9f9d1b0] >>>>> 6: (Monitor::handle_command(boost::intrusive_ptr<MonOpRequest>)+0x1f93) [0x55a8c9e63193] >>>>> 7: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xa0e) [0x55a8c9e6a52e] >>>>> 8: (Monitor::_ms_dispatch(Message*)+0x6db) [0x55a8c9e6b57b] >>>>> 9: (Monitor::ms_dispatch(Message*)+0x23) [0x55a8c9e9a053] >>>>> 10: (DispatchQueue::entry()+0xf4a) [0x55a8ca3b5f7a] >>>>> 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x55a8ca16bc1d] >>>>> 12: (()+0x76ba) [0x7fc86b3ac6ba] >>>>> 13: (clone()+0x6d) [0x7fc869bd63dd] >>>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. >>>>> >>>>> I'm running Luminous. The cluster and FS have been in service since Hammer and have default data/metadata pool names. I discovered the issue after attempting to enable directory sharding. >>>> >>>> Well that's not good... >>>> >>>> The assertion is because your FSMap is referring to a pool that >>>> apparently no longer exists in the OSDMap. This should be impossible >>>> in current Ceph (we forbid removing pools if they're in use), but >>>> could perhaps have been caused in an earlier version of Ceph when it >>>> was possible to remove a pool even if CephFS was referring to it? >>>> >>>> Alternatively, perhaps something more severe is going on that's >>>> causing your mons to see a wrong/inconsistent view of the world. Has >>>> the cluster ever been through any traumatic disaster recovery type >>>> activity involving hand-editing any of the cluster maps? What >>>> intermediate versions has it passed through on the way from Hammer to >>>> Luminous? >>>> >>>> Opened a ticket here: http://tracker.ceph.com/issues/21568 >>>> >>>> John >>> >>> I've reviewed my notes (i.e. I've grepped my IRC logs); I actually inherited this cluster from a colleague who left shortly after I joined, so unfortunately there is some of its history I cannot fill in. >>> >>> Turns out the cluster actually predates Firefly. Looking at dates my suspicion is that it went Emperor -> Firefly -> Giant -> Hammer. I inherited it at Hammer, and took it Hammer -> Infernalis -> Jewel -> Luminous myself. I know I did make sure to do the tmap_upgrade step on cephfs but can't remember if I did it at Infernalis or Jewel. >>> >>> Infernalis was a tricky upgrade; the attempt was aborted once after the first set of OSDs didn't come back up after upgrade (had to remove/downgrade and readd), and setting sortbitwise as the documentation suggested after a successful second attempt caused everything to break and degrade slowly until it was unset and recovered. Never had disaster recovery involve mucking around with the pools while I was administrating it, but unfortunately I cannot speak for the cluster's pre-Hammer history. The only pools I have removed were ones I created temporarily for testing crush rules/benchmarking. >> >> OK, so it sounds like a cluster with an interesting history and some >> stories to tell :-) >> >>> I have hand-edited the crush map (extract, decompile, modify, recompile, inject) at times because I found it more convenient for creating new crush rules than using the CLI tools, but not the OSD map. >>> >>> Why would the cephfs have been referring to other pools? >> >> A filesystem can have more than one data pool, they're added at >> runtime with add_data_pool/rm_data_pool commands. In old versions of >> the code, someone could add a data pool, then delete the pool, and >> forget to do rm_data_pool. >> >> So, next step is to try and actually get the FSMap out of the >> monitor's store to see if that's really what's happening -- >> unfortunately when checking how to do that I realise we missed >> updating the human readable output of ceph-monstore-tool when adding >> multi-filesystem support... so here's how to get it out in binary form >> and then decode it separately: >> >> ceph-monstore-tool /var/lib/ceph/<wherever...> get mdsmap > fsmap.bin >> ceph-dencoder import fsmap.bin type FSMap decode dump_json >> >> If the simple theory is correct then you'll see something referenced >> in one of the pool/pools fields that doesn't really exist. >> >> John > > After some effort working out which package provides ceph-monstore-tool in ubuntu xenial (turns out it's ceph-test, but I now have every ceph*[-dbg] package installed on my admin node)... you are indeed correct: > > root@vm-ds-01:~/ceph-conf# systemctl stop ceph-mon.target > root@vm-ds-01:~/ceph-conf# ceph-monstore-tool /var/lib/ceph/mon/ceph-vm-ds-01/ get mdsmap > fsmap.bin > root@vm-ds-01:~/ceph-conf# systemctl start ceph-mon.target > root@vm-ds-01:~/ceph-conf# ceph-dencoder import fsmap.bin type FSMap decode dump_json > ... > "data_pools": [ > 0, > 5, > 7, > 8 > ], > ... > root@vm-ds-01:~/ceph-conf# ceph osd lspools > 0 data,1 metadata,8 one,22 ssd,24 .rgw.root,25 default.rgw.control,26 default.rgw.data.root,27 default.rgw.gc,28 default.rgw.log,29 default.rgw.users.uid,30 default.rgw.users.keys,31 default.rgw.buckets.index,32 default.rgw.buckets.non-ec,33 default.rgw.buckets.data, > > The large gaps in the pool numbers are a good sign someone did a lot of experimenting. Turns out the fs does reference pools that don't exist, plus one that does but shouldn't be in the fs. A quick application of `ceph fs rm_data_pool` and ceph fs commands now return without issue: > > root@vm-ds-01:~/ceph-conf# ceph fs rm_data_pool cephfs one > removed data pool 8 from fsmap > root@vm-ds-01:~/ceph-conf# ceph fs rm_data_pool cephfs 5 > removed data pool 5 from fsmap > root@vm-ds-01:~/ceph-conf# ceph fs rm_data_pool cephfs 7 > removed data pool 7 from fsmap > root@vm-ds-01:~/ceph-conf# ceph fs ls > name: cephfs, metadata pool: metadata, data pools: [data ] > > And I can finally do what I was trying to do three days ago which prompted all this: > > root@vm-ds-01:~/ceph-conf# ceph fs set cephfs allow_dirfrags 1 > enabled directory fragmentation Fantastic, have fun with your huge directories :-D John > Thanks! > Rich > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com