Re: cephfs-data-scan safety on active filesystem

John Spray <jspray@xxxxxxxxxx> · Tue, 8 May 2018 10:05:11 +0100

On Mon, May 7, 2018 at 8:50 PM, Ryan Leimenstoll
<rleimens@xxxxxxxxxxxxxx> wrote:
> Hi All,
>
> We recently experienced a failure with our 12.2.4 cluster running a CephFS
> instance that resulted in some data loss due to a seemingly problematic OSD
> blocking IO on its PGs. We restarted the (single active) mds daemon during
> this, which caused damage due to the journal not having the chance to flush
> back. We reset the journal, session table, and fs to bring the filesystem
> online. We then removed some directories/inodes that were causing the
> cluster to report damaged metadata (and were otherwise visibly broken by
> navigating the filesystem).

This may be over-optimistic of me, but is there any chance you kept a
detailed record of exactly what damage was reported, and what you did
to the filesystem so far?  It's hard to give any intelligent advice on
repairing it, when we don't know exactly what was broken, and a bunch
of unknown repair-ish things have already manipulated the metadata
behind the scenes.

John

> With that, there are now some paths that seem to have been orphaned (which
> we expected). We did not run the ‘cephfs-data-scan’ tool [0] in the name of
> getting the system back online ASAP. Now that the filesystem is otherwise
> stable, can we initiate a scan_links operation with the mds active safely?
>
> [0]
> http://docs.ceph.com/docs/luminous/cephfs/disaster-recovery/#recovery-from-missing-metadata-objects
>
> Thanks much,
> Ryan Leimenstoll
> rleimens@xxxxxxxxxxxxxx
> University of Maryland Institute for Advanced Computer Studies
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com