Re: how to debug (in order to repair) damaged MDS (rank)?

John Spray <jspray@xxxxxxxxxx> · Tue, 10 Oct 2017 11:57:27 +0100

On Tue, Oct 10, 2017 at 10:28 AM, Daniel Baumann <daniel.baumann@xxxxxx> wrote:
> Hi all,
>
> unfortunatly I'm still struggling bringing cephfs back up after one of
> the MDS has been marked "damaged" (see messages from monday).
>
> 1. When I mark the rank as "repaired", this is what I get in the monitor
>    log (leaving unrelated leveldb compacting chatter aside):
>
> 2017-10-10 10:51:23.177865 7f3290710700  0 log_channel(audit) log [INF]
> : from='client.? 147.87.226.72:0/1658479115' entity='client.admin' cmd
> ='[{"prefix": "mds repaired", "rank": "6"}]': finished
> 2017-10-10 10:51:23.177993 7f3290710700  0 log_channel(cluster) log
> [DBG] : fsmap cephfs-9/9/9 up  {0=mds1=up:resolve,1=mds2=up:resolve,2=mds3
> =up:resolve,3=mds4=up:resolve,4=mds5=up:resolve,5=mds6=up:resolve,6=mds9=up:replay,7=mds7=up:resolve,8=mds8=up:resolve}
> [...]
>
> 2017-10-10 10:51:23.492040 7f328ab1c700  1 mon.mon1@0(leader).mds e96186
>  mds mds.? 147.87.226.189:6800/524543767 can't write to fsmap compat=
> {},rocompat={},incompat={1=base v0.20,2=client writeable
> ranges,3=default file layouts on dirs,4=dir inode in separate
> object,5=mds uses versi
> oned encoding,6=dirfrag is stored in omap,8=file layout v2}
> [...]
>
> 2017-10-10 10:51:24.291827 7f328d321700 -1 log_channel(cluster) log
> [ERR] : Health check failed: 1 mds daemon damaged (MDS_DAMAGE)
>
> 2. ...and this is what I get on the mds:
>
> 2017-10-10 11:21:26.537204 7fcb01702700 -1 mds.6.journaler.pq(ro)
> _decode error from assimilate_prefetch
> 2017-10-10 11:21:26.537223 7fcb01702700 -1 mds.6.purge_queue _recover:
> Error -22 recovering write_pos

This is probably the root cause: somehow the PurgeQueue (one of the
on-disk metadata structures) has become corrupt.

The purge queue objects for rank 6 will all have names starting "506."
in the metadata pool.

This is probably the result of a bug of some kind, so to give us a
chance of working out what went wrong let's gather some evidence
first:
 A) Do a "rados -p <metadata pool> ls | grep "^506\." or similar, to
get a list of the objects
 B) Write a short bash loop to do a "rados -p <metadata pool> get" on
each of those objects into a file.
 C) Stop the MDS, set "debug mds = 20" and "debug journaler = 20",
mark the rank repaired, start the MDS again, and then gather the
resulting log (it should end in the same "Error -22 recovering
write_pos", but have much much more detail about what came before).

Because you've hit a serious bug, it's really important to gather all
this and share it, so that we can try to fix it and prevent it
happening again to you or others.

Once you've put all that evidence somewhere safe, you can start
intervening to repair it.  The good news is that this is the best part
of your metadata to damage, because all it does is record the list of
deleted files to purge.

You have two options, depending on how much downtime you can tolerate:
 - carefully remove all the metadata objects that start with 506. --
this will cause that MDS rank to completely forget about purging
anything in its queue.  This will leave some orphan data objects in
the data pool that will never get cleaned up (without doing some more
offline repair).
 - inspect the detailed logs from step C of the evidence gathering, to
work out exactly how far the journal loading got before hitting
something corrupt.  Then with some finer-grained editing of the
on-disk objects, we can persuade it to skip over the part that was
damaged.

John

> (see attachment for the full mds log during the "repair" action)
>
>
> I'm really stuck here and would greatly appreciate any help. How can I
> see what is actually going on/the problem? Running ceph-mon/ceph-mds
> with debug levels logs just "damaged" as quoted above, but doesn't tell
> what is wrong or why it's failing.
>
> would going back to single MDS with "ceph fs reset" allow me to access
> the data again?
>
>
> Regards,
> Daniel
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com