Re: Errors when scrub ~mdsdir and lots of num_strays

Dan van der Ster <dvanders@xxxxxxxxx> · Tue, 1 Mar 2022 11:01:37 +0100

Hi,

stray files are created when you have hardlinks to deleted files, or
snapshots of deleted files.
You need to delete the snapshots, or "reintegrate" the hardlinks by
recursively listing the relevant files.

BTW, in pacific there isn't a big problem with accumulating lots of
stray files. (Before pacific there was a default limit of 1M strays,
but that is now removed).

Cheers, dan

On Tue, Mar 1, 2022 at 1:04 AM Arnaud M <arnaud.meauzoone@xxxxxxxxx> wrote:
>
> Hello to everyone
>
> Our ceph cluster is healthy and everything seems to go well but we have a
> lot of num_strays
>
> ceph tell mds.0 perf dump | grep stray
>         "num_strays": 1990574,
>         "num_strays_delayed": 0,
>         "num_strays_enqueuing": 0,
>         "strays_created": 3,
>         "strays_enqueued": 17,
>         "strays_reintegrated": 0,
>         "strays_migrated": 0,
>
> And num_strays doesn't seems to reduce whatever we do (scrub / or scrub
> ~mdsdir)
> And when we scrub ~mdsdir (force,recursive,repair) we get thoses error
>
> {
>         "damage_type": "dir_frag",
>         "id": 3775653237,
>         "ino": 1099569233128,
>         "frag": "*",
>         "path": "~mds0/stray3/100036efce8"
>     },
>     {
>         "damage_type": "dir_frag",
>         "id": 3776355973,
>         "ino": 1099567262916,
>         "frag": "*",
>         "path": "~mds0/stray3/1000350ecc4"
>     },
>     {
>         "damage_type": "dir_frag",
>         "id": 3776485071,
>         "ino": 1099559071399,
>         "frag": "*",
>         "path": "~mds0/stray4/10002d3eea7"
>     },
>
> And just before the end of the ~mdsdir scrub the mds crashes and I have to
> do a
>
> ceph mds repaired 0 to have the filesystem back online
>
> A lot of them. Do you have any ideas of what those errors are and how
> should I handle them ?
>
> We have a lot of data in our cephfs cluster 350 TB+ and we takes snapshot
> everyday of / and keep them for 1 month (rolling)
>
> here is our cluster state
>
> ceph -s
>   cluster:
>     id:     817b5736-84ae-11eb-bf7b-c9513f2d60a9
>     health: HEALTH_WARN
>             78 pgs not deep-scrubbed in time
>             70 pgs not scrubbed in time
>
>   services:
>     mon: 3 daemons, quorum ceph-r-112-1,ceph-g-112-3,ceph-g-112-2 (age 10d)
>     mgr: ceph-g-112-2.ghcodb(active, since 4d), standbys:
> ceph-g-112-1.ksojnh
>     mds: 1/1 daemons up, 1 standby
>     osd: 67 osds: 67 up (since 14m), 67 in (since 7d)
>
>   data:
>     volumes: 1/1 healthy
>     pools:   5 pools, 609 pgs
>     objects: 186.86M objects, 231 TiB
>     usage:   351 TiB used, 465 TiB / 816 TiB avail
>     pgs:     502 active+clean
>              82  active+clean+snaptrim_wait
>              20  active+clean+snaptrim
>              4   active+clean+scrubbing+deep
>              1   active+clean+scrubbing+deep+snaptrim_wait
>
>   io:
>     client:   8.8 MiB/s rd, 39 MiB/s wr, 25 op/s rd, 54 op/s wr
>
> My questions are about the damage found on the ~mdsdir scrub, should I
> worry about it ? What does it mean ? It seems to be linked with my issue of
> the high number of strays, is it right ? How to fix it and how to reduce
> num_stray ?
>
> Thank for all
>
> All the best
>
> Arnaud
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx