On Tue, Oct 11, 2016 at 12:00 PM, Henrik Korkuc <lists@xxxxxxxxx> wrote: > Hey, > > After a bright idea to pause 10.2.2 Ceph cluster for a minute to see if it > will speed up backfill I managed to corrupt my MDS journal (should it happen > after cluster pause/unpause, or is it some sort of a bug?). I had "Overall > journal integrity: DAMAGED", etc Uh, pause/unpausing your RADOS cluster should never do anything apart from pausing IO. That's DEFINITELY a severe bug if it corrupted objects! > I was following http://docs.ceph.com/docs/jewel/cephfs/disaster-recovery/ > and have some questions/feedback: Caveat: This is a difficult area to document, because the repair tools interfere with internal on-disk structures. If I can use a bad metaphor: it's like being in an auto garage, and asking for documentation about the tools -- the manual for the wrench doesn't tell you anything about how to fix the car engine. Similarly it's hard to write useful documentation about the repair tools without also writing a detailed manual for how all the cephfs internals work. > * It would be great to have some info when ‘snap’ or ‘inode’ should be reset You would reset these tables if you knew that for some reason they no longer matched the reality elsewhere in the metadata. > * It is not clear when MDS start should be attempted You would start the MDS when you believed that you had done all you could with offline repair. Everything on the "disaster recovery" page is about offline tools. > * Can scan_extents/scan_inodes be run after MDS is running? These are meant only for offline use. You could in principle run scan_extents while an MDS was running as long as you had no data writes going on. scan_inodes writes directly into the metadata pool so is certainly not safe to run at the same time as an active MDS. > * "online MDS scrub" is mentioned in docs. Is it scan_extents/scan_inodes or > some other command? That refers to the "forward scrub" functionality inside the MDS, that's invoked with "scrub_path" or "tag path" commands. > Now CephFS seems to be working (I have "mds0: Metadata damage detected" but > scan_extends is currently running), let's see what happens when I finish > scan_extends/scan_inodes. > > Will these actions solve possible orphaned objects in pools? What else > should I look into? A full offline scan_extents/scan_inodes run should re-link orphans into a top-level lost+found directory (from which you can subsequently delete them when your MDS is back online). John > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com