MDS corruption

☣Adam <adam@xxxxxxxxx> · Thu, 8 Aug 2019 15:31:59 -0500

I had a machine with insufficient memory and it seems to have corrupted
data on my MDS.  The filesystem seems to be working fine, with the
exception of accessing specific files.

The ceph-mds logs include things like:
mds.0.1596621 unhandled write error (2) No such file or directory, force
readonly...
dir 0x1000000fb03 object missing on disk; some files may be lost
(/adam/programming/bash)

I'm using mimic and trying to follow the instructions here:
https://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/

The punchline is this:
cephfs-journal-tool --rank all journal export backup.bin
Error ((22) Invalid argument)
2019-08-08 20:02:39.847 7f06827537c0 -1 main: Couldn't determine MDS rank.

I have a backup (outside of ceph) of all data which is inaccessible and
I can back anything which is accessible if need be.  There's some more
information below, but my main question is: what are my next steps?

On a side note, I'd like to get involved with helping with documentation
(man pages, the ceph website, usage text, etc). Where can I get started?

Here's the context:

cephfs-journal-tool event recover_dentries summary
Error ((22) Invalid argument)
2019-08-08 19:50:04.798 7f21f4ffe7c0 -1 main: missing mandatory "--rank"
argument

Seems like a bug in the documentation since `--rank` is a "mandatory
option" according to the help text.  It looks like the rank of this node
for MDS is 0, based on `ceph health detail`, but using `--rank 0` or
`--rank all` doesn't work either:

ceph health detail
HEALTH_ERR 1 MDSs report damaged metadata; 1 MDSs are read only
MDS_DAMAGE 1 MDSs report damaged metadata
    mdsge.hax0rbana.org(mds.0): Metadata damage detected
MDS_READ_ONLY 1 MDSs are read only
    mdsge.hax0rbana.org(mds.0): MDS in read-only mode

cephfs-journal-tool --rank 0 event recover_dentries summary
Error ((22) Invalid argument)
2019-08-08 19:54:45.583 7f5b37c4c7c0 -1 main: Couldn't determine MDS rank.

The only place I've found this error message is in an unanswered
stackoverflow question and in the source code here:
https://github.com/ceph/ceph/blob/master/src/tools/cephfs/JournalTool.cc#L114

It looks like that is trying to read a filesystem map (fsmap), which
might be corrupted.  Running `rados export` prints part of the help text
and then segfaults, which is rather concerning.  This is 100% repeatable
(outside of gdb, details below).  I tried `rados df` and that worked
fine, so it's not all rados commands which are having this problem.
However, I tried `rados bench 60 seq` and that also printed out the
usage text and then segfaulted.

Info on the `rados export` crash:
rados export
usage: rados [options] [commands]
POOL COMMANDS
<snip>
IMPORT AND EXPORT
   export [filename]
       Serialize pool contents to a file or standard out.
<snip>
OMAP OPTIONS:
    --omap-key-file file            read the omap key from a file
*** Caught signal (Segmentation fault) **
 in thread 7fcb6bfff700 thread_name:fn_anonymous

When running it in gdb:
(gdb) bt
#0  0x00007fffef07331f in std::_Rb_tree<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >,
std::pair<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const, std::map<int, boost::variant<boost::blank,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> >, unsigned long, long, double, bool,
entity_addr_t, std::chrono::duration<long, std::ratio<1l, 1l> >,
Option::size_t, uuid_d>, std::less<int>, std::allocator<std::pair<int
const, boost::variant<boost::blank, std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >, unsigned long, long,
double, bool, entity_addr_t, std::chrono::duration<long, std::ratio<1l,
1l> >, Option::size_t, uuid_d> > > > >,
std::_Select1st<std::pair<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const, std::map<int,
boost::variant<boost::blank, std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >, unsigned long, long,
double, bool, entity_addr_t, std::chrono::duration<long, std::ratio<1l,
1l> >, Option::size_t, uuid_d>, std::less<int>,
std::allocator<std::pair<int const, boost::variant<boost::blank,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> >, unsigned long, long, double, bool,
entity_addr_t, std::chrono::duration<long, std::ratio<1l, 1l> >,
Option::size_t, uuid_d> > > > > >,
std::less<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > >,
std::allocator<std::pair<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const, std::map<int,
boost::variant<boost::blank, std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >, unsigned long, long,
double, bool, entity_addr_t, std::chrono::duration<long, std::ratio<1l,
1l> >, Option::size_t, uuid_d>, std::less<int>,
std::allocator<std::pair<int const, boost::variant<boost::blank,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> >, unsigned long, long, double, bool,
entity_addr_t, std::chrono::duration<long, std::ratio<1l, 1l> >,
Option::size_t, uuid_d> > > > > >
>::find(std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&) const () from
/usr/lib/ceph/libceph-common.so.0
Backtrace stopped: Cannot access memory at address 0x7fffd9ff89f8

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com