Re: How often should I scrub the filesystem ?

Chris Palmer <chris.palmer@xxxxxxxxx> · Sat, 12 Mar 2022 19:14:52 +0000

Hi Miland (or anyone else who can help...)

Reading this thread made me realise I had overlooked cephfs scrubbing, 
so i tried it on a small 16.2.7 cluster. The normal forward scrub showed 
nothing. However "ceph tell mds.0 scrub start ~mdsdir recursive" did 
find one backtrace error (putting the cluster into HEALTH_ERR). I then 
did a repair which according to the log did rewrite the inode, and 
subsequent scrubs have not found it.

However the cluster health is still ERR, and the MDS still shows the damage:

ceph@xxxx1:~$ ceph tell mds.0 damage ls
2022-03-12T18:42:01.609+0000 7f1b817fa700  0 client.173985213 ms_handle_reset on v2:192.168.80.121:6824/939134894
2022-03-12T18:42:01.625+0000 7f1b817fa700  0 client.173985219 ms_handle_reset on v2:192.168.80.121:6824/939134894
[
    {
        "damage_type": "backtrace",
        "id": 3308827822,
        "ino": 256,
        "path": "~mds0"
    }
]

What are the right steps from here? Has the error actually been 
corrected but just needs clearing or is it still there?

In case it is relevant: there is one active and two standby MDS. The log 
is from the node currently hosting rank 0.

From the mds log:

2022-03-12T18:13:41.593+0000 7f61d30c1700  1 mds.xxxx1 asok_command: scrub start {path=~mdsdir,prefix=scrub start,scrubops=[recursive]} (starting...)
2022-03-12T18:13:41.593+0000 7f61cb0b1700  0 log_channel(cluster) log [INF] : scrub queued for path: ~mds0
2022-03-12T18:13:41.593+0000 7f61cb0b1700  0 log_channel(cluster) log [INF] : scrub summary: idle+waiting paths [~mds0]
2022-03-12T18:13:41.593+0000 7f61cb0b1700  0 log_channel(cluster) log [INF] : scrub summary: active paths [~mds0]
2022-03-12T18:13:41.601+0000 7f61cb0b1700  0 log_channel(cluster) log [WRN] : Scrub error on inode 0x100 (~mds0) see mds.xxxx1 log and `damage ls` output for details
2022-03-12T18:13:41.601+0000 7f61cb0b1700 -1 mds.0.scrubstack _validate_inode_done scrub error on inode [inode 0x100 [...2,head] ~mds0/ auth v6798 ap=1 snaprealm=0x55d59548
4800 f(v0 10=0+10) n(v1815 rc2022-03-12T16:01:44.218294+0000 b1017620718 375=364+11)/n(v0 rc2019-10-29T10:52:34.302967+0000 11=0+11) (inest lock) (iversion lock) | dirtysca
ttered=0 lock=0 dirfrag=1 openingsnapparents=0 dirty=1 authpin=1 scrubqueue=0 0x55d595486000]: {"performed_validation":true,"passed_validation":false,"backtrace":{"checked"
:true,"passed":false,"read_ret_val":-61,"ondisk_value":"(-1)0x0:[]//[]","memoryvalue":"(11)0x100:[]//[]","error_str":"failed to read off disk; see retval"},"raw_stats":{"ch
ecked":true,"passed":true,"read_ret_val":0,"ondisk_value.dirstat":"f(v0 10=0+10)","ondisk_value.rstat":"n(v0 rc2022-03-12T16:01:44.218294+0000 b1017620718 375=364+11)","mem
ory_value.dirstat":"f(v0 10=0+10)","memory_value.rstat":"n(v1815 rc2022-03-12T16:01:44.218294+0000 b1017620718 375=364+11)","error_str":""},"return_code":-61}
2022-03-12T18:13:41.601+0000 7f61cb0b1700  0 log_channel(cluster) log [INF] : scrub summary: idle+waiting paths [~mds0]
2022-03-12T18:13:45.317+0000 7f61cf8ba700  0 log_channel(cluster) log [INF] : scrub summary: idle

2022-03-12T18:13:52.881+0000 7f61d30c1700  1 mds.xxxx1 asok_command: scrub start {path=~mdsdir,prefix=scrub start,scrubops=[recursive,repair]} (starting...)
2022-03-12T18:13:52.881+0000 7f61cb0b1700  0 log_channel(cluster) log [INF] : scrub queued for path: ~mds0
2022-03-12T18:13:52.881+0000 7f61cb0b1700  0 log_channel(cluster) log [INF] : scrub summary: idle+waiting paths [~mds0]
2022-03-12T18:13:52.881+0000 7f61cb0b1700  0 log_channel(cluster) log [INF] : scrub summary: active paths [~mds0]
2022-03-12T18:13:52.881+0000 7f61cb0b1700  0 log_channel(cluster) log [WRN] : bad backtrace on inode 0x100(~mds0), rewriting it
2022-03-12T18:13:52.881+0000 7f61cb0b1700  0 log_channel(cluster) log [INF] : Scrub repaired inode 0x100 (~mds0)
2022-03-12T18:13:52.881+0000 7f61cb0b1700 -1 mds.0.scrubstack _validate_inode_done scrub error on inode [inode 0x100 [...2,head] ~mds0/ auth v6798 ap=1 snaprealm=0x55d595484800 DIRTYPARENT f(v0 10=0+10) n(v1815 rc2022-03-12T16:01:44.218294+0000 b1017620718 375=364+11)/n(v0 rc2019-10-29T10:52:34.302967+0000 11=0+11) (inest lock) (iversion lock) | dirtyscattered=0 lock=0 dirfrag=1 openingsnapparents=0 dirtyparent=1 dirty=1 authpin=1 scrubqueue=0 0x55d595486000]: {"performed_validation":true,"passed_validation":false,"backtrace":{"checked":true,"passed":false,"read_ret_val":-61,"ondisk_value":"(-1)0x0:[]//[]","memoryvalue":"(11)0x100:[]//[]","error_str":"failed to read off disk; see retval"},"raw_stats":{"checked":true,"passed":true,"read_ret_val":0,"ondisk_value.dirstat":"f(v0 10=0+10)","ondisk_value.rstat":"n(v0 rc2022-03-12T16:01:44.218294+0000 b1017620718 375=364+11)","memory_value.dirstat":"f(v0 10=0+10)","memory_value.rstat":"n(v1815 rc2022-03-12T16:01:44.218294+0000 b1017620718 375=364+11)","error_str":""},"return_code":-61}
2022-03-12T18:13:52.881+0000 7f61cb0b1700  0 log_channel(cluster) log [INF] : scrub summary: idle+waiting paths [~mds0]
2022-03-12T18:13:55.317+0000 7f61cf8ba700  0 log_channel(cluster) log [INF] : scrub summary: idle

2022-03-12T18:14:12.608+0000 7f61d30c1700  1 mds.xxxx1 asok_command: scrub start {path=~mdsdir,prefix=scrub start,scrubops=[recursive,repair]} (starting...)
2022-03-12T18:14:12.608+0000 7f61cb0b1700  0 log_channel(cluster) log [INF] : scrub queued for path: ~mds0
2022-03-12T18:14:12.608+0000 7f61cb0b1700  0 log_channel(cluster) log [INF] : scrub summary: idle+waiting paths [~mds0]
2022-03-12T18:14:12.608+0000 7f61cb0b1700  0 log_channel(cluster) log [INF] : scrub summary: active paths [~mds0]
2022-03-12T18:14:12.608+0000 7f61cb0b1700  0 log_channel(cluster) log [INF] : scrub summary: idle+waiting paths [~mds0]
2022-03-12T18:14:15.316+0000 7f61cf8ba700  0 log_channel(cluster) log [INF] : scrub summary: idle

Thanks, Chris

On 11/03/2022 12:24, Milind Changire wrote:
Here's some answers to your questions:

On Sun, Mar 6, 2022 at 3:57 AM Arnaud M<arnaud.meauzoone@xxxxxxxxx>  wrote:

Hello to everyone :)

Just some question about filesystem scrubbing

In this documentation it is said that scrub will help admin check
consistency of filesystem:

https://docs.ceph.com/en/latest/cephfs/scrub/

So my questions are:

Is filesystem scrubbing mandatory ?
How often should I scrub the whole filesystem (ie start at /)
How often should I scrub ~mdsdir
Should I set up a cronjob ?
Is filesystem scrubbing considerated armless ? Even with recursive force
repair ?
Is there any chance for scrubbing to overload mds on a big file system (ie
like find . -ls) ?
What is the difference between "recursive repair" and "recursive force
repair" ? Is "force" armless ?
Is there any way to see at which file/folder is the scrub operation ? In
fact any better way to see srub progress than "scrub status" which doesn't
say much

Sorry for all the questions, but there is not that much documentation about
filesystem scrubbing. And I do think the answers will help a lot of cephfs
administrators :)

Thanks to all

All the best

Arnaud
_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx

    1.

    Is filesystem scrubbing mandatory ?
    As a routine system administration practice, it is good to ensure that
    your file-system is always in a good state. To avoid getting the
    file-system into a bottleneck state during work hours, it's always a good
    idea to reserve some time to run a recursive forward scrub and use the
    in-built scrub automation to fix such issues. Although you can run the
    scrub at any directory of your choice, it's always a good practice to start
    the scrub at the file-system root once in a while.

So file-system scrubbing is not mandatory but highly recommended.

Filesystem scrubbing is designed to read CephFS’ metadata and detect
inconsistencies or issues that are generated by bitrot or bugs, just as
RADOS’ pg scrubbing is. In a perfect world without bugs or bit flips it
would be unnecessary, but we don’t live in that world — so a scrub can
detect small issues before they turn into big ones, and the mere act of
reading data can keep it fresh and give storage devices a chance to correct
any media errors while that’s still possible.

We don’t have a specific recommended schedule and scrub takes up cluster IO
and compute resources so its frequency should be tailored to your workload.

    1.

    How often should I scrub the whole filesystem (ie start at /)
    Since you'd always want to have a consistent file-system, it would good
    to run scrubbing:
    1.

       before taking a snapshot of the entire file-system OR
       2.

       before taking a backup of the entire file-system OR
       3.

       after significant metadata activity eg. after creating files,
       renaming files, deleting files, changing file attributes, etc.

There's no one-rule-fixes-all scenario. So, you'll need to follow a
heuristic approach. The type of devices (HDD or SSD), the amount of
activity wearing the device are the typical factors involved when deciding
to scrub a file-system. If you have some window dedicated for backup
activity, then you’d want to run a recursive forward scrub with repair on
the entire file-system before it is snapshotted and used for backup.
Although you can run a scrub along with active use of the file-system, it
is always recommended that you run the scrub on a quiet file-system so that
neither of the activities get in each other’s way. This also helps in
completing the scrub task quicker.

    1.

    How often should I scrub ~mdsdir ?
    ~mdsdir is used to collect deleted (stray) entries. So, the number of
    file/dir unlinks in a typical workload should be used to come up with a
    heuristic to scrub the file-system. This activity can be taken up
    separately from scrubbing the file-system root.

    1.

    Should I set up a cron job ?

Yes, you could.

    1.

    Is filesystem scrubbing considered harmless ? Even with recursive force
    repair ?

Yes, scrubbing even with repair is harmless.

Scrubbing with repair does the following things:

    1.

    Repair backtrace
    If on-disk and in-memory backtraces don't match, then the DIRTYPARENT
    flag is set so that the journal logger thread picks the inode for writing
    the backtrace to the disk.
    2.

    Repair inode
    If on-disk and in-memory inode versions don't match, then the inode is
    left untouched. Otherwise, if the inode is marked as "free", the inode
    number is removed from active use.
    3.

    Repair recursive-stats
    If on-disk and in-memory raw-stats don't match, then all the stats for
    the leaves in the directory tree are marked dirty and a scatter-gather
    operation is forced to coalesce raw-stats info.

    1.

    Is there any chance for scrubbing to overload mds on a big file system
    ie. like find . -ls ?
    Scrubbing on its own should not be able to overload an MDS, but it is an
    additional load on top of whatever client activity the MDS is serving,
    which could exceed the server’s capacity. To put it in short, yes, it might
    overload the mds when done in sustained high I/O scenarios.
    The mds config option mds_max_scrub_ops_in_progress, which defaults to
    5, decides the number of scrubs running at any given time. So, there is a
    small effort at throttling.

    1.

    What is the difference between "recursive repair" and "recursive force
    repair" ? Is "force" harmless ?
    If “force” argument is specified, then a dirfrag is scrubbed only if
    1.

       The dentry version is greater than last scrub version AND
       2.

       The dentry type is a DIR

If “force” is not specified, then dirfrag scrubbing is skipped. You will be
able to see an mds log saying that the scrubbing is skipped for the dentry.

The rest of the scrubbing is done as described in Q5 above.

    1.

    Is there any way to see at which file/folder is the scrub operation ? In
    fact any better way to see scrub progress than "scrub status" which doesn't
    say much.
    Currently there's no way to see which file/folder is being scrubbed. At
    most we could log a line in the mds logs about it, but it could soon cause
    logs to bloat if the number of entries are large.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx