Re: musings on cephfs fsck ...

Sage Weil <sage@xxxxxxxxxxx> · Sun, 10 Feb 2013 21:32:09 -0800 (PST)

We had a hangout on Friday and made some progress.  The problem can be 
roughly decomposed into two types of checks: a forward, depth-first check 
that scans the namespace and verifies everything is valid and reachable, 
and a (slow) reverse iteration over all directory (or even file) objects 
that verifies that everything is reachable via its backtrace.

Forward scan:

 - depth-first search of the namespace.
 - run by the mds
 - mds can be online
 - mds can efficiently verify the rstats as it goes
   - we can propagate any 'dirty' rstats up the tree as we find them.
 - optionally verify the backtrace values on directories and/or files as 
   we go
   - this can be efficient to: a single backtrace update op can be guarded 
     by a check that the backtrace is valid/up to date.
 - if we encounter a bad forward link (a primary dentry that references a 
   directory whose object does not exist), we add it to the set M
 - we can store M as a key/value object somewhere.
 - we complete the scan and M is empty, and the rstats/dirstats were all 
   clean/correct, we are happy.
 - this can presumably be made a parallel operation without too much 
   trouble
 - we could put a last_scrubbed_subtree stamp in the dirstat struct, but 
   it would mean dirtying directories as we go, which may not be what we 
   want in all cases.  maybe make it optional.
 - i don't think there is any reason why the fs couldn't online and in use 
   while this is going on.  and that is what we want, anyway: online 
   namespace scrubbing for healthy file systems.

This would probably be the first piece we implement.  There would be a 
fast mode and a slow mode, depending on whether we verify 
backtraces.

Backward scan:

The forward scan doesn't help us find lost/orphan directories, and it only 
detects lost directories near the root.  If M is non-empty, we need to do 
the slow part.  A backward scan would:

 - iterate over directory objects in the metadata pool
   - note that these are stored in hash(ino) order, so we get them in 
     random order.
 - for each directory object, we look at the backtrace and traverse 
   forward in the tree to verify it is reachable.
   - if it is not, we recreate the missing directory and add in the 
     forward link from our backtrace.
   - we can mark that link as 'tentative' or 'recovered' or something, 
     along with the (now) partially repopulated directory (which is in M)

Note that doing the above for *all* items would be sufficient, but 
horribly slow... basically, O(n * log^2(n)) or something (assuming some 
caching).  But if M is smallish:

 - only verify backtrace for directories that are descendents of an item 
   in M
   - and only verify the portion of the backtrace that is rooted above the 
     item from M; ignore the child portion.
   - queue a forward scrub starting from that point to see if the 
     reattached subtree is complete.

Note that this is probably not so inefficient, even when that includes a 
lot of directories, because the cache hit rate will be good.  Say we 
have/had /a/b/c/(1..10000000), and b was lost/in M.  For the first c child 
we hit, we would recreate an empty b and link it to c.  Every subsequent 
child would hit the cached c (with a single in-memory ino lookup) and/or 
verify the a/b/c forward dentry/link and we'd move on.

The grand-child of an item in M could be missing to.  So if b and c are 
both lost, and we encounter /a/b/c/1, we add the c link and requeue the 
forward scan, but immediately find that c is missing too and add it to M. 
At that point, we would need to restart the scan.  That is probably not 
ideal.  Instead, each item in M could have a range or interval_set over 
which it has been scanned.  The initial items would be [], but if we add 
in c part-way through the pool, we could just continue our forward scan 
and keep track of which parts of the namespace are checked for that 
ancestor.  

After a full pass over the metadata pool (all dir objects) and/or data 
pool(s) (file objects), items in M that were present for the full pass can 
be marked 'complete' in the fs, and their ancestors adjusted with the new 
summation/stat/last_clean values.

The scan could be optimized in a couple of ways (maybe):

 - push the O(n) search to the OSD
   - a class 'filter' thing would examine each object and determine if it 
     is a match.  it could
   - it could parse the backtrace and pull out just the b and ancestors 
     part, and uniq it over a bunch of objects.
 - maintain an actual index on the OSD
   - add some new rados functionality to maintain a real index over select 
     object xattrs, such as backtraces (or more likely individual 
     backtrace components), so that the search is O(log n) instead of 
     O(n).  or something along those lines.

The trick in pushing filtering to the OSD is that it needs to know what M 
is.  We can do that easily for small M, otherwise we'd need to use a bloom 
filter or something.

The implementation tricks will be:

 - making the forward scan parallel/distributed.  handing off subtrees to
   scan when they are delegated across nodes is the main thing, i think.  
   if there are actual repairs, those would probably appear to be normal 
   updates/mutations.  the existing cache coherency should make the 
   parent's summation/check on a dir after the children complete not much 
   different than a normal filelock scatter/gather type operation.
 - making the forward scan recoverable if an mds restarts.  periodically 
   journal our current 'iterator' position(s)?  or, if we actually write 
   out a last_clean stamp, we would just restart at the beginning and skip
   any recently-scrubbed subtrees.
 - allowing forward and backward work to be going on at the same time.  a 
   backward scan will periodically requeue subtrees for forward scanning.  
   and a forward scan may add to M.

On that last point, I think deconstructing the depth-first search into 
discrete pre- and post- steps (and not recursion) will simplify things, 
and make it more easily parallel.  For example:

scan(dir):
 - add dir to Q

worker():
 - pop X from front of Q
 - if X is a dir:
   - enumerate children
     - push a "done with X" item to Q
     - then push all children on the front of Q
 - if X is a missing dir:
   - add to M
 - if X is a completion:
   - do post(X)

post(dir):
 - enumerate children
 - recalculation summation (rstat, dirstat, whatever)
 - verify that dir inode summation value is correct

...with something else in there to make us defer the post step for 
any items or ancestors of items in M.  And once repaired items are removed 
from M, somehow requeue the post step so they and their ancestors can 
validate the summations.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html