nautilus version (14.2.2) of ‘cephfs-data-scan scan_links’ can fix snaptable. hopefully it will fix your issue. you don't need to upgrade whole cluster. Just install nautilus in a temp machine or compile ceph from source. On Tue, Aug 13, 2019 at 2:35 PM Adam <adam@xxxxxxxxx> wrote: > > Pierre Dittes helped me with adding --rank=yourfsname:all and I ran the > following steps from the disaster recovery page: journal export, dentry > recovery, journal truncation, mds table wipes (session, snap and inode), > scan_extents, scan_inodes, scan_links, and cleanup. > > Now all three of my MDS servers are crashing due to a failed assert. > Logs with stacktrace are included (the other two servers have the same > stacktrace in their logs). > > Currently I can't mount cephfs (which makes sense since there aren't any > MDS services up for more than a few minutes before they crash). Any > suggestions on next steps to troubleshoot/fix this? > > Hopefully there's some way to recover from this and I don't have to tell > my users that I lost all the data and we need to go back to the backups. > It shouldn't be a huge problem if we do, but it'll lose a lot of > confidence in ceph and its ability to keep data safe. > > Thanks, > Adam > > On 8/8/19 3:31 PM, Adam wrote: > > I had a machine with insufficient memory and it seems to have corrupted > > data on my MDS. The filesystem seems to be working fine, with the > > exception of accessing specific files. > > > > The ceph-mds logs include things like: > > mds.0.1596621 unhandled write error (2) No such file or directory, force > > readonly... > > dir 0x1000000fb03 object missing on disk; some files may be lost > > (/adam/programming/bash) > > > > I'm using mimic and trying to follow the instructions here: > > https://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/ > > > > The punchline is this: > > cephfs-journal-tool --rank all journal export backup.bin > > Error ((22) Invalid argument) > > 2019-08-08 20:02:39.847 7f06827537c0 -1 main: Couldn't determine MDS rank. > > > > I have a backup (outside of ceph) of all data which is inaccessible and > > I can back anything which is accessible if need be. There's some more > > information below, but my main question is: what are my next steps? > > > > On a side note, I'd like to get involved with helping with documentation > > (man pages, the ceph website, usage text, etc). Where can I get started? > > > > > > > > Here's the context: > > > > cephfs-journal-tool event recover_dentries summary > > Error ((22) Invalid argument) > > 2019-08-08 19:50:04.798 7f21f4ffe7c0 -1 main: missing mandatory "--rank" > > argument > > > > Seems like a bug in the documentation since `--rank` is a "mandatory > > option" according to the help text. It looks like the rank of this node > > for MDS is 0, based on `ceph health detail`, but using `--rank 0` or > > `--rank all` doesn't work either: > > > > ceph health detail > > HEALTH_ERR 1 MDSs report damaged metadata; 1 MDSs are read only > > MDS_DAMAGE 1 MDSs report damaged metadata > > mdsge.hax0rbana.org(mds.0): Metadata damage detected > > MDS_READ_ONLY 1 MDSs are read only > > mdsge.hax0rbana.org(mds.0): MDS in read-only mode > > > > cephfs-journal-tool --rank 0 event recover_dentries summary > > Error ((22) Invalid argument) > > 2019-08-08 19:54:45.583 7f5b37c4c7c0 -1 main: Couldn't determine MDS rank. > > > > > > The only place I've found this error message is in an unanswered > > stackoverflow question and in the source code here: > > https://github.com/ceph/ceph/blob/master/src/tools/cephfs/JournalTool.cc#L114 > > > > It looks like that is trying to read a filesystem map (fsmap), which > > might be corrupted. Running `rados export` prints part of the help text > > and then segfaults, which is rather concerning. This is 100% repeatable > > (outside of gdb, details below). I tried `rados df` and that worked > > fine, so it's not all rados commands which are having this problem. > > However, I tried `rados bench 60 seq` and that also printed out the > > usage text and then segfaulted. > > > > > > > > > > > > Info on the `rados export` crash: > > rados export > > usage: rados [options] [commands] > > POOL COMMANDS > > <snip> > > IMPORT AND EXPORT > > export [filename] > > Serialize pool contents to a file or standard out. > > <snip> > > OMAP OPTIONS: > > --omap-key-file file read the omap key from a file > > *** Caught signal (Segmentation fault) ** > > in thread 7fcb6bfff700 thread_name:fn_anonymous > > > > When running it in gdb: > > (gdb) bt > > #0 0x00007fffef07331f in std::_Rb_tree<std::__cxx11::basic_string<char, > > std::char_traits<char>, std::allocator<char> >, > > std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, > > std::allocator<char> > const, std::map<int, boost::variant<boost::blank, > > std::__cxx11::basic_string<char, std::char_traits<char>, > > std::allocator<char> >, unsigned long, long, double, bool, > > entity_addr_t, std::chrono::duration<long, std::ratio<1l, 1l> >, > > Option::size_t, uuid_d>, std::less<int>, std::allocator<std::pair<int > > const, boost::variant<boost::blank, std::__cxx11::basic_string<char, > > std::char_traits<char>, std::allocator<char> >, unsigned long, long, > > double, bool, entity_addr_t, std::chrono::duration<long, std::ratio<1l, > > 1l> >, Option::size_t, uuid_d> > > > >, > > std::_Select1st<std::pair<std::__cxx11::basic_string<char, > > std::char_traits<char>, std::allocator<char> > const, std::map<int, > > boost::variant<boost::blank, std::__cxx11::basic_string<char, > > std::char_traits<char>, std::allocator<char> >, unsigned long, long, > > double, bool, entity_addr_t, std::chrono::duration<long, std::ratio<1l, > > 1l> >, Option::size_t, uuid_d>, std::less<int>, > > std::allocator<std::pair<int const, boost::variant<boost::blank, > > std::__cxx11::basic_string<char, std::char_traits<char>, > > std::allocator<char> >, unsigned long, long, double, bool, > > entity_addr_t, std::chrono::duration<long, std::ratio<1l, 1l> >, > > Option::size_t, uuid_d> > > > > >, > > std::less<std::__cxx11::basic_string<char, std::char_traits<char>, > > std::allocator<char> > >, > > std::allocator<std::pair<std::__cxx11::basic_string<char, > > std::char_traits<char>, std::allocator<char> > const, std::map<int, > > boost::variant<boost::blank, std::__cxx11::basic_string<char, > > std::char_traits<char>, std::allocator<char> >, unsigned long, long, > > double, bool, entity_addr_t, std::chrono::duration<long, std::ratio<1l, > > 1l> >, Option::size_t, uuid_d>, std::less<int>, > > std::allocator<std::pair<int const, boost::variant<boost::blank, > > std::__cxx11::basic_string<char, std::char_traits<char>, > > std::allocator<char> >, unsigned long, long, double, bool, > > entity_addr_t, std::chrono::duration<long, std::ratio<1l, 1l> >, > > Option::size_t, uuid_d> > > > > > > >> ::find(std::__cxx11::basic_string<char, std::char_traits<char>, > > std::allocator<char> > const&) const () from > > /usr/lib/ceph/libceph-common.so.0 > > Backtrace stopped: Cannot access memory at address 0x7fffd9ff89f8 > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com