Re: HowTo CephgFS recovery tools?

John Spray <jspray@xxxxxxxxxx> · Tue, 15 Sep 2015 11:19:39 +0100

Preface: as I mentioned on your other thread about repairing a
filesystem, these tools are not finished yet, and probably won't be
comprehensively documented until they all fit together.  So my answers
here are for interest, not an encouragement to treat these tools as
"ready to go".

On Tue, Sep 15, 2015 at 8:31 AM, Goncalo Borges
<goncalo@xxxxxxxxxxxxxxxxxxx> wrote:
> 1. The cluster looses more OSDs than the number of configured replicas.
> 2. There is loss of all objects for specific PGs in the data pool.
> 3. There is loss of all objects for specific PGs in the metadata pool.
> 4./ The cluster as been recovered mostly by deleting the problematic OSDs
> from the crush map, and by recreating staled PGs.
> 5./ MDS has been restarted and Cephfs remounted
>
> My questions are:
>
> a./ If the MDS journal can be replayed, it may be able to recreate some of
> the loss metadata, if it still maintains relevant information in log. Is
> this correct?

Yes.  If metadata is played back from the journal, but is not present
in the metadata pool, then you have a chance to recover from that by
letting the "forward scrub" mechanism detect the missing metadata and
repair it.  That mechanism doesn't fully exist yet.

If the journal is not replayable, or the MDS won't start for some
other reason, then the "cephfs-journal-tool recover_dentries" will do
its best to recover what metadata it can from the journal and inject
it directly to the metadata pool while the MDS is offline.

> b./ If all the metadata for given files is loss, but the files themselves
> have all their objects intact, would we be able to mount cephfs? If yes, how
> would those files appear in the filesystem? With a '??? ??? ???'  for the
> attributes? And in the same location as before?

If all metadata for files is lost, they will not be visible in client
mounts at all.  You will only be aware of their existence if you go
and list the objects in the data pool using rados tools.

Creating new metadata for these files (so that they are once again
accessible) is what "cephfs-data-scan scan_inodes" is for.  It
enumerates the objects in the data pool and uses their backtraces to
guess at where to insert metadata in the filesystem.  If it can't make
a guess, it puts in them in lost+found/.

Things like permissions are not recoverable in this kind of scenario:
they're reset to some sensible defaults.  The purpose of this mode is
primarily to allow administrators to recover the data in the files --
we would expect administrators to be inspecting the results and
applying their own idea of what permissions should be after the
disaster recovery has happened.

> c./ In the situation reported in b./, what would be the proper steps to
> start injecting metadata for the orphan files? Please correct me if I am
> wrong, but I am assuming
>
>     - cephfs-table-tool 0 reset session
>     - cephfs-table-tool 0 reset snap
>     - cephfs-table-tool 0 reset inode
>     - cephfs-journal-tool --rank=0 journal reset
>     - cephfs-data-scan init
>     - cephfs-data-scan scan_extents <data pool>
>     - cephfs-data-scan scan_inodes <data pool>

Yep, pretty much.  But I want to emphasize that this is not a
one-size-fits-all procedure, and it is not necessarily safe, and it
does not necessarily result in an improvement.  Disaster recovery is
meant to be done by identifying what is broken, and cautiously making
the minimum necessary interventions to bring the FS back online.

For example, currently when you do this you are resetting the set of
free inodes, which you probably didn't want to do unless you really
had lost the inode table.

> d./ In step c./ what is the exact difference between 'cephfs-table-tool 0
> reset session', 'cephfs-table-tool 0 reset snap' and 'cephfs-table-tool 0
> reset inode'? Are there situations where we do not want to use the 3
> cephfs-table-tool reset commands but just one or two?

These are resetting separate metadata structures.  It's entirely
possible that one has a problem but others don't.  For example there
might be a bogus client session in there, but that doesn't mean you
would want to reset the inode table (which stores what inodes are
available).

I can sort of see where you're going with this -- couldn't we have a
fewer-steps procedure?  Yes, but it would have to be something much
smarter that identified faults and did the minimum amount of repair,
rather than a nuclear "reset everything" hammer that tried to do
everything in one go.  In the early days of these tools, they are
going to remain pretty low level and require a lot of expertise to
use.  In support jargon, these tools are meant for level 3 -- expert
(probably developer) intervention in rare cases.

> e./ In what circumstances we would do a reset of the filesystem with 'ceph
> fs reset cephfs --yes-i-really-mean-it'?

This is resetting the MDSMap.  You might do this if for example your
map indicated that there should be several MDS ranks, but you know the
metadata for the ranks is trashed, and you want to recover the
filesystem within a single rank, rather than trying to rebuild
consistent metadata for a multi-mds state.

Cheers,
John
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com