Re: Errors when scrub ~mdsdir and lots of num_strays

Dan van der Ster <dvanders@xxxxxxxxx> · Tue, 1 Mar 2022 11:39:13 +0100

Hi,

There was a recent (long) thread about this. It might give you some hints:
   https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/2NT55RUMD33KLGQCDZ74WINPPQ6WN6CW/

And about the crash, it could be related to
https://tracker.ceph.com/issues/51824

Cheers, dan

On Tue, Mar 1, 2022 at 11:30 AM Arnaud M <arnaud.meauzoone@xxxxxxxxx> wrote:
>
> Hello Dan
>
> Thanks a lot for the answer
>
> i do remove the the snap everydays (I keep them for one month)
> But the "num_strays" never seems to reduce.
>
> I know I can do a listing of the folder with "find . -ls".
>
> So my question is: is there a way to find the directory causing the strays so I can "find . ls" them ? I would prefer not to do it on my whole cluster as it will take time (several days and more if i need to do it also on every snap) and will certainly overload the mds.
>
> Please let me know if there is a way to spot the source of strays ? So I can find the folder/snap with the biggest strays ?
>
> And what about the scrub of ~mdsdir who crashes every times with the error:
>
> {
>         "damage_type": "dir_frag",
>         "id": 3776355973,
>         "ino": 1099567262916,
>         "frag": "*",
>         "path": "~mds0/stray3/1000350ecc4"
>     },
>
> Again, thanks for your help, that is really appreciated
>
> All the best
>
> Arnaud
>
> Le mar. 1 mars 2022 à 11:02, Dan van der Ster <dvanders@xxxxxxxxx> a écrit :
>>
>> Hi,
>>
>> stray files are created when you have hardlinks to deleted files, or
>> snapshots of deleted files.
>> You need to delete the snapshots, or "reintegrate" the hardlinks by
>> recursively listing the relevant files.
>>
>> BTW, in pacific there isn't a big problem with accumulating lots of
>> stray files. (Before pacific there was a default limit of 1M strays,
>> but that is now removed).
>>
>> Cheers, dan
>>
>> On Tue, Mar 1, 2022 at 1:04 AM Arnaud M <arnaud.meauzoone@xxxxxxxxx> wrote:
>> >
>> > Hello to everyone
>> >
>> > Our ceph cluster is healthy and everything seems to go well but we have a
>> > lot of num_strays
>> >
>> > ceph tell mds.0 perf dump | grep stray
>> >         "num_strays": 1990574,
>> >         "num_strays_delayed": 0,
>> >         "num_strays_enqueuing": 0,
>> >         "strays_created": 3,
>> >         "strays_enqueued": 17,
>> >         "strays_reintegrated": 0,
>> >         "strays_migrated": 0,
>> >
>> > And num_strays doesn't seems to reduce whatever we do (scrub / or scrub
>> > ~mdsdir)
>> > And when we scrub ~mdsdir (force,recursive,repair) we get thoses error
>> >
>> > {
>> >         "damage_type": "dir_frag",
>> >         "id": 3775653237,
>> >         "ino": 1099569233128,
>> >         "frag": "*",
>> >         "path": "~mds0/stray3/100036efce8"
>> >     },
>> >     {
>> >         "damage_type": "dir_frag",
>> >         "id": 3776355973,
>> >         "ino": 1099567262916,
>> >         "frag": "*",
>> >         "path": "~mds0/stray3/1000350ecc4"
>> >     },
>> >     {
>> >         "damage_type": "dir_frag",
>> >         "id": 3776485071,
>> >         "ino": 1099559071399,
>> >         "frag": "*",
>> >         "path": "~mds0/stray4/10002d3eea7"
>> >     },
>> >
>> > And just before the end of the ~mdsdir scrub the mds crashes and I have to
>> > do a
>> >
>> > ceph mds repaired 0 to have the filesystem back online
>> >
>> > A lot of them. Do you have any ideas of what those errors are and how
>> > should I handle them ?
>> >
>> > We have a lot of data in our cephfs cluster 350 TB+ and we takes snapshot
>> > everyday of / and keep them for 1 month (rolling)
>> >
>> > here is our cluster state
>> >
>> > ceph -s
>> >   cluster:
>> >     id:     817b5736-84ae-11eb-bf7b-c9513f2d60a9
>> >     health: HEALTH_WARN
>> >             78 pgs not deep-scrubbed in time
>> >             70 pgs not scrubbed in time
>> >
>> >   services:
>> >     mon: 3 daemons, quorum ceph-r-112-1,ceph-g-112-3,ceph-g-112-2 (age 10d)
>> >     mgr: ceph-g-112-2.ghcodb(active, since 4d), standbys:
>> > ceph-g-112-1.ksojnh
>> >     mds: 1/1 daemons up, 1 standby
>> >     osd: 67 osds: 67 up (since 14m), 67 in (since 7d)
>> >
>> >   data:
>> >     volumes: 1/1 healthy
>> >     pools:   5 pools, 609 pgs
>> >     objects: 186.86M objects, 231 TiB
>> >     usage:   351 TiB used, 465 TiB / 816 TiB avail
>> >     pgs:     502 active+clean
>> >              82  active+clean+snaptrim_wait
>> >              20  active+clean+snaptrim
>> >              4   active+clean+scrubbing+deep
>> >              1   active+clean+scrubbing+deep+snaptrim_wait
>> >
>> >   io:
>> >     client:   8.8 MiB/s rd, 39 MiB/s wr, 25 op/s rd, 54 op/s wr
>> >
>> > My questions are about the damage found on the ~mdsdir scrub, should I
>> > worry about it ? What does it mean ? It seems to be linked with my issue of
>> > the high number of strays, is it right ? How to fix it and how to reduce
>> > num_stray ?
>> >
>> > Thank for all
>> >
>> > All the best
>> >
>> > Arnaud
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx