Re: CephFS recovery from missing metadata objects questions

Wido den Hollander <wido@xxxxxxxx> · Wed, 7 Dec 2016 21:07:08 +0100 (CET)

> Op 7 december 2016 om 20:54 schreef John Spray <jspray@xxxxxxxxxx>:
> 
> 
> On Wed, Dec 7, 2016 at 7:47 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
> >
> >> Op 7 december 2016 om 16:53 schreef John Spray <jspray@xxxxxxxxxx>:
> >>
> >>
> >> On Wed, Dec 7, 2016 at 3:46 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
> >> >
> >> >> Op 7 december 2016 om 16:38 schreef John Spray <jspray@xxxxxxxxxx>:
> >> >>
> >> >>
> >> >> On Wed, Dec 7, 2016 at 3:28 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
> >> >> > (I think John knows the answer, but sending to ceph-users for archival purposes)
> >> >> >
> >> >> > Hi John,
> >> >> >
> >> >> > A Ceph cluster lost a PG with CephFS metadata in there and it is currently doing a CephFS disaster recovery as described here: http://docs.ceph.com/docs/master/cephfs/disaster-recovery/
> >> >>
> >> >> I wonder if this has any relation to your thread about size=2 pools ;-)
> >> >
> >> > Yes, it does!
> >> >
> >> >>
> >> >> > This data pool has 1.4B objects and currently has 16 concurrent scan_extents scans running:
> >> >> >
> >> >> > # cephfs-data-scan --debug-rados=10 scan_extents --worker_n 0 --worker_m 16 cephfs_metadata
> >> >> > # cephfs-data-scan --debug-rados=10 scan_extents --worker_n 1 --worker_m 16 cephfs_metadata
> >> >> > ..
> >> >> > ..
> >> >> > # cephfs-data-scan --debug-rados=10 scan_extents --worker_n 15 --worker_m 16 cephfs_metadata
> >> >> >
> >> >> > According to the source in DataScan.cc:
> >> >> > * worker_n: Worker number
> >> >> > * worker_m: Worker count
> >> >> >
> >> >> > So with the commands above I have 16 workers running, correct? For the scan_inodes I want to scale out to 32 workers to speed up the process even more.
> >> >> >
> >> >> > Just to double-check before I send a new PR to update the docs, this is the right way to run the tool, correct?
> >> >>
> >> >> It looks like you're targeting cephfs_metadata instead of your data pool.
> >> >>
> >> >> scan_extents and scan_inodes operate on data pools, even if your goal
> >> >> is to rebuild your metadata pool (the argument is what you are
> >> >> scanning, not what you are writing to).
> >> >
> >> > That was a typo of me when typing this e-mail. It is scanning the *data* pool at the moment.
> >> >
> >> > Can you confirm that the worker_n and worker_m arguments are the correct ones?
> >>
> >> Yep, they look right to me.
> >
> > Ok, great. I pushed a PR to update the docs and help. Care to review it?
> >
> > https://github.com/ceph/ceph/pull/12370
> >
> >>
> >> >>
> >> >> There is also a "scan_frags" command that operates on a metadata pool.
> >> >
> >> > Didn't know that. In this case the metadata pool is missing objects due to that lost PG.
> >> >
> >> > I think the scan_extents and scan_inodes on the *data* pool is the correct way to rebuild the metadata pool if it is missing objects, right?
> >>
> >> In general you'd use both scan_frags (to re-link any orphaned
> >> directories that might have been orphaned if they had an ancestor
> >> dirfrag in the lost PG) and then scan_extents+scan_inodes (to re-link
> >> any orphaned files that might have been orphaned because their
> >> immediate parent dirfrag was in the lost PG).
> >>
> >> However scan_extents+scan_inodes is generally doing the lion's share
> >> of the work because anything that scan_frags would have caught would
> >> probably also have appeared somewhere in a backtrace path and got
> >> linked in by scan_inodes as a result, so you should probably just skip
> >> scan_frags in this instance.
> >>
> >> BTW, you've probably already realised this, but be *very* cautious
> >> about using the recovered filesystem: our testing of these tools is
> >> mostly verifying that after recovery we can see and read the files
> >> (i.e. well enough to extract them somewhere else), not that the
> >> filesystem is necessarily working well for writes etc after being
> >> recovered.  If it's possible, then it's always better to recover your
> >> files to a separate location, and then rebuild your filesystem with
> >> fresh pools -- that way you're not risking that there as anything
> >> strange left behind by the recovery process.
> >>
> >
> > I'm aware of this. Currently trying to make the best out of this situation and get the FS up and running.
> >
> > The MDS was running fine for about 24 hours, but started to assert on missing RADOS objects in the metadata pool. So we had to resort to this scan which takes a long, very long time.
> >
> > 2016-12-07 08:29:58.852595 7f3d74c96700 -1 log_channel(cluster) log [ERR] : dir 10011a4767b object missing on disk; some files may be lost
> > 2016-12-07 08:29:58.855070 7f3d74c96700 -1 mds/MDCache.cc: In function 'virtual void C_MDC_OpenInoTraverseDir::finish(int)' thread 7f3d74c96700 time 2016-12-07 08:29:58.852637
> > mds/MDCache.cc: 8213: FAILED assert(r >= 0)
> 
> Oops, that's a bug.  These cases are supposed to make the MDS record
> damage and EIO requests to things beneath that path, not assert out.
> Could you open a ticket please?

Done! http://tracker.ceph.com/issues/18179

Thanks again for the quick responses, very much appreciated!

Wido

> 
> John
> 
> >
> > ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
> >
> > Wido
> >
> >> John
> >>
> >> > Wido
> >> >
> >> >>
> >> >> John
> >> >>
> >> >> > If not, before sending the PR and starting scan_inodes on this cluster, what is the correct way to invoke the tool?
> >> >> >
> >> >> > Thanks!
> >> >> >
> >> >> > Wido
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com