Re: cephfs-data-scan safety on active filesystem

John Spray <jspray@xxxxxxxxxx> · Wed, 9 May 2018 11:53:12 +0100

On Tue, May 8, 2018 at 8:49 PM, Ryan Leimenstoll
<rleimens@xxxxxxxxxxxxxx> wrote:
> Hi Gregg, John,
>
> Thanks for the warning. It was definitely conveyed that they are dangerous. I thought the online part was implied to be a bad idea, but just wanted to verify.
>
> John,
>
> We were mostly operating off of what the mds logs reported. After bringing the mds back online and active, we mounted the volume using the kernel driver to one host and started a recursive ls through the root of the filesystem to see what was broken. There were seemingly two main paths of the tree that were affected initially, both reporting errors like the following in the mds log (I’ve swapped out the paths):
>
> Group 1:
> 2018-05-04 12:04:38.004029 7fc81f69a700 -1 log_channel(cluster) log [ERR] : dir 0x10011125556 object missing on disk; some files may be lost (/cephfs/redacted1/path/dir1)
> 2018-05-04 12:04:38.028861 7fc81f69a700 -1 log_channel(cluster) log [ERR] : dir 0x1001112bf14 object missing on disk; some files may be lost (/cephfs/redacted1/path/dir2)
> 2018-05-04 12:04:38.030504 7fc81f69a700 -1 log_channel(cluster) log [ERR] : dir 0x10011131118 object missing on disk; some files may be lost (/cephfs/redacted1/path/dir3)
>
> Group 2:
> 2021-05-04 13:24:29.495892 7fc81f69a700 -1 log_channel(cluster) log [ERR] : dir 0x1001102c5f6 object missing on disk; some files may be lost (/cephfs/redacted2/path/dir1)
>
> For some of the paths it complained about were empty via ls, although trying to rm [-r] them via the mount failed with an error suggesting files still exist in the directory. We removed the dir object in the metadata pool that it was still warning about (rados -p metapool rm 10011125556.0000, for example). This cleaned up errors on this path. We then did the same for Group 2.
>
> After this, we initiated a recursive scrub with the mds daemon on the root of the filesystem to run over the weekend.
>
> In retrospect, we probably should have done the data scan steps mentioned in the disaster recovery guide before bringing the system online. The cluster is currently healthy (or, rather, reporting healthy) and has been for a while.
>
> My understanding here is that we would need something like the cephfs-data-scan steps to recreate metadata or at least identify (for cleanup) objects that may have been stranded in the data pool. Is there anyway, likely with another tool, to do this for an active cluster? If not, is this something that can be done with some amount of safety on an offline system? (not sure how long it would take, data pool is ~100T large w/ 242 million objects, and downtime is a big pain point for our users with deadlines).

When you do a forward scrub, there is an option to apply a "tag" (an
arbitrary string) to the data objects of files that are present in the
metadata tree (i.e. non-orphans).  There is then a "--filter-tag"
option to cephfs-data-scan, that enables skipping everything that's
tagged.  That enables you to then do a scan_extents,scan_inodes
process targeting only the orphans, which will recover them into a
lost+found directory.  If you ultimately just want to delete them, you
can then do that from lost+found once the filesystem is back online.
In that process, the forward scrub can happen while the cluster is
online, but the cephfs-data-scan bits happen while the MDS is *not*
running.

The alternative way to do it is to write yourself a script that
recursively lists the filesystem, keep a big index of which inodes
exist in the filesystem, and then scan through all the objects in the
data pool and remove anything with an inode prefix that isn't in the
set.  You could do that without stopping your MDS, although of course
you would need to either stop creating new files, or have your script
include any inode above a certain limit in your list of inodes that
exist.

In any case, I'd strongly recommend you create a separate filesystem
to play with first, to work out your procedure.  You can synthesise
the damage from a lost PG easily by random deleting some percentage of
the objects in your metadata pool.

John

> Thanks,
>
> Ryan
>
>> On May 8, 2018, at 5:05 AM, John Spray <jspray@xxxxxxxxxx> wrote:
>>
>> On Mon, May 7, 2018 at 8:50 PM, Ryan Leimenstoll
>> <rleimens@xxxxxxxxxxxxxx> wrote:
>>> Hi All,
>>>
>>> We recently experienced a failure with our 12.2.4 cluster running a CephFS
>>> instance that resulted in some data loss due to a seemingly problematic OSD
>>> blocking IO on its PGs. We restarted the (single active) mds daemon during
>>> this, which caused damage due to the journal not having the chance to flush
>>> back. We reset the journal, session table, and fs to bring the filesystem
>>> online. We then removed some directories/inodes that were causing the
>>> cluster to report damaged metadata (and were otherwise visibly broken by
>>> navigating the filesystem).
>>
>> This may be over-optimistic of me, but is there any chance you kept a
>> detailed record of exactly what damage was reported, and what you did
>> to the filesystem so far?  It's hard to give any intelligent advice on
>> repairing it, when we don't know exactly what was broken, and a bunch
>> of unknown repair-ish things have already manipulated the metadata
>> behind the scenes.
>>
>> John
>>
>>> With that, there are now some paths that seem to have been orphaned (which
>>> we expected). We did not run the ‘cephfs-data-scan’ tool [0] in the name of
>>> getting the system back online ASAP. Now that the filesystem is otherwise
>>> stable, can we initiate a scan_links operation with the mds active safely?
>>>
>>> [0]
>>> http://docs.ceph.com/docs/luminous/cephfs/disaster-recovery/#recovery-from-missing-metadata-objects
>>>
>>> Thanks much,
>>> Ryan Leimenstoll
>>> rleimens@xxxxxxxxxxxxxx
>>> University of Maryland Institute for Advanced Computer Studies
>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com