Preface: as I mentioned on your other thread about repairing a filesystem, these tools are not finished yet, and probably won't be comprehensively documented until they all fit together. So my answers here are for interest, not an encouragement to treat these tools as "ready to go". On Tue, Sep 15, 2015 at 8:31 AM, Goncalo Borges <goncalo@xxxxxxxxxxxxxxxxxxx> wrote: > 1. The cluster looses more OSDs than the number of configured replicas. > 2. There is loss of all objects for specific PGs in the data pool. > 3. There is loss of all objects for specific PGs in the metadata pool. > 4./ The cluster as been recovered mostly by deleting the problematic OSDs > from the crush map, and by recreating staled PGs. > 5./ MDS has been restarted and Cephfs remounted > > My questions are: > > a./ If the MDS journal can be replayed, it may be able to recreate some of > the loss metadata, if it still maintains relevant information in log. Is > this correct? Yes. If metadata is played back from the journal, but is not present in the metadata pool, then you have a chance to recover from that by letting the "forward scrub" mechanism detect the missing metadata and repair it. That mechanism doesn't fully exist yet. If the journal is not replayable, or the MDS won't start for some other reason, then the "cephfs-journal-tool recover_dentries" will do its best to recover what metadata it can from the journal and inject it directly to the metadata pool while the MDS is offline. > b./ If all the metadata for given files is loss, but the files themselves > have all their objects intact, would we be able to mount cephfs? If yes, how > would those files appear in the filesystem? With a '??? ??? ???' for the > attributes? And in the same location as before? If all metadata for files is lost, they will not be visible in client mounts at all. You will only be aware of their existence if you go and list the objects in the data pool using rados tools. Creating new metadata for these files (so that they are once again accessible) is what "cephfs-data-scan scan_inodes" is for. It enumerates the objects in the data pool and uses their backtraces to guess at where to insert metadata in the filesystem. If it can't make a guess, it puts in them in lost+found/. Things like permissions are not recoverable in this kind of scenario: they're reset to some sensible defaults. The purpose of this mode is primarily to allow administrators to recover the data in the files -- we would expect administrators to be inspecting the results and applying their own idea of what permissions should be after the disaster recovery has happened. > c./ In the situation reported in b./, what would be the proper steps to > start injecting metadata for the orphan files? Please correct me if I am > wrong, but I am assuming > > - cephfs-table-tool 0 reset session > - cephfs-table-tool 0 reset snap > - cephfs-table-tool 0 reset inode > - cephfs-journal-tool --rank=0 journal reset > - cephfs-data-scan init > - cephfs-data-scan scan_extents <data pool> > - cephfs-data-scan scan_inodes <data pool> Yep, pretty much. But I want to emphasize that this is not a one-size-fits-all procedure, and it is not necessarily safe, and it does not necessarily result in an improvement. Disaster recovery is meant to be done by identifying what is broken, and cautiously making the minimum necessary interventions to bring the FS back online. For example, currently when you do this you are resetting the set of free inodes, which you probably didn't want to do unless you really had lost the inode table. > d./ In step c./ what is the exact difference between 'cephfs-table-tool 0 > reset session', 'cephfs-table-tool 0 reset snap' and 'cephfs-table-tool 0 > reset inode'? Are there situations where we do not want to use the 3 > cephfs-table-tool reset commands but just one or two? These are resetting separate metadata structures. It's entirely possible that one has a problem but others don't. For example there might be a bogus client session in there, but that doesn't mean you would want to reset the inode table (which stores what inodes are available). I can sort of see where you're going with this -- couldn't we have a fewer-steps procedure? Yes, but it would have to be something much smarter that identified faults and did the minimum amount of repair, rather than a nuclear "reset everything" hammer that tried to do everything in one go. In the early days of these tools, they are going to remain pretty low level and require a lot of expertise to use. In support jargon, these tools are meant for level 3 -- expert (probably developer) intervention in rare cases. > e./ In what circumstances we would do a reset of the filesystem with 'ceph > fs reset cephfs --yes-i-really-mean-it'? This is resetting the MDSMap. You might do this if for example your map indicated that there should be several MDS ranks, but you know the metadata for the ranks is trashed, and you want to recover the filesystem within a single rank, rather than trying to rebuild consistent metadata for a multi-mds state. Cheers, John _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com