There are several issues we need to address. 0- Online scrub. We should be able to do a (parallel) pass over the directory tree that verifies that metadata is consistent. Forward dentry links agree with file and directory backtraces. rstats and dirstats are correct. While the fs is active. 1- Missing/corrupt/incompete mds journal. If journal replay fails, currently we just bomb out. Instead, we need to do one or more of: - throw out journal(s) and current mds cluster members, and bring mds(s) up. - scour as much useful metadata out of the remaining bits of the journal first. - have a mode/flag (or not?) where we are in 'recovery' or 'unclean' mode and will do some sort of online namespace repair. 2- Missing directory. This is the tricky one because a catastrophic rados failure like a loss of a PG would mean we lose a random subset of the directory objects. This would effectively prune off random subtrees of the hierarchy, and we'd like to be able to find and reattach them. This is what the backtrace stuff is there for. 3- Corrupt directory object, or corrupt inode metadata. If a directory appears corrupt, we should salvage what we can and try to rebuild the rest. This may just be similar to be above. Our prevous discussions have focussed on how to handle #2, but I'm hoping we can distill this down to a few common recovery behaviors that cover entire classes of inconsistencies. Handling a missing directory object is probably the hardest piece, so let's start there. The basic idea we discussed before is to build a working list of missing directories, say M. We then do a scan of the objects in the metadata pool for objects that are children of M accoring to their backtrace, and tentatively link them into place. If we encounter some other forward link to one of those children with a newer versoin stamp, the newer link wins and the tentative link is removed. Having complete confidence in the recovered links means we need to scan the full namespace. Searching for children is current an O(n) operation for RADOS, unless/until we introduce some indexing of objects. This may be feasible with leveldb, but it's not there yet. We need to also identify lost grandchildren. For /a/b/c, the a and b directory objects may have been lost, but we only know that a is lost from the broken /a link. If we only search for / children, the rebuilt /a/ won't include b (which is also lost). This means our tentative links may need to include multiple ancestors, or be a multiple-pass type of operation. The possibility of directory renames makes this especially interesting. It may be that we shoot not so much for perfectly relinking subtrees, but rather for exhaustively linking them, and just aggressively push out backtrace updates after renames so that files will reappear somewhere reasonably recent. (This is disaster recovery, after all.) At a high level, what we are trying to recover is a map of ino -> (version, parent ino, name) for all parent inos in the missing set. I'm pretty sure we had significantly more insight into what was involved here, but it was almost 2 years ago now since we discussed it and I'm having trouble dredging it up. Hopefully this is enough to get people started... sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html