Re: Errors when scrub ~mdsdir and lots of num_strays

Arnaud M <arnaud.meauzoone@xxxxxxxxx> · Tue, 1 Mar 2022 10:50:43 +0100

I am using ceph pacific (16.2.5)

Does anyone have an idea about my issues ?

Thanks again to everyone

All the best

Arnaud

Le mar. 1 mars 2022 à 01:04, Arnaud M <arnaud.meauzoone@xxxxxxxxx> a écrit :

> Hello to everyone
>
> Our ceph cluster is healthy and everything seems to go well but we have a
> lot of num_strays
>
> ceph tell mds.0 perf dump | grep stray
>         "num_strays": 1990574,
>         "num_strays_delayed": 0,
>         "num_strays_enqueuing": 0,
>         "strays_created": 3,
>         "strays_enqueued": 17,
>         "strays_reintegrated": 0,
>         "strays_migrated": 0,
>
> And num_strays doesn't seems to reduce whatever we do (scrub / or scrub
> ~mdsdir)
> And when we scrub ~mdsdir (force,recursive,repair) we get thoses error
>
> {
>         "damage_type": "dir_frag",
>         "id": 3775653237,
>         "ino": 1099569233128,
>         "frag": "*",
>         "path": "~mds0/stray3/100036efce8"
>     },
>     {
>         "damage_type": "dir_frag",
>         "id": 3776355973,
>         "ino": 1099567262916,
>         "frag": "*",
>         "path": "~mds0/stray3/1000350ecc4"
>     },
>     {
>         "damage_type": "dir_frag",
>         "id": 3776485071,
>         "ino": 1099559071399,
>         "frag": "*",
>         "path": "~mds0/stray4/10002d3eea7"
>     },
>
> And just before the end of the ~mdsdir scrub the mds crashes and I have to
> do a
>
> ceph mds repaired 0 to have the filesystem back online
>
> A lot of them. Do you have any ideas of what those errors are and how
> should I handle them ?
>
> We have a lot of data in our cephfs cluster 350 TB+ and we takes snapshot
> everyday of / and keep them for 1 month (rolling)
>
> here is our cluster state
>
> ceph -s
>   cluster:
>     id:     817b5736-84ae-11eb-bf7b-c9513f2d60a9
>     health: HEALTH_WARN
>             78 pgs not deep-scrubbed in time
>             70 pgs not scrubbed in time
>
>   services:
>     mon: 3 daemons, quorum ceph-r-112-1,ceph-g-112-3,ceph-g-112-2 (age 10d)
>     mgr: ceph-g-112-2.ghcodb(active, since 4d), standbys:
> ceph-g-112-1.ksojnh
>     mds: 1/1 daemons up, 1 standby
>     osd: 67 osds: 67 up (since 14m), 67 in (since 7d)
>
>   data:
>     volumes: 1/1 healthy
>     pools:   5 pools, 609 pgs
>     objects: 186.86M objects, 231 TiB
>     usage:   351 TiB used, 465 TiB / 816 TiB avail
>     pgs:     502 active+clean
>              82  active+clean+snaptrim_wait
>              20  active+clean+snaptrim
>              4   active+clean+scrubbing+deep
>              1   active+clean+scrubbing+deep+snaptrim_wait
>
>   io:
>     client:   8.8 MiB/s rd, 39 MiB/s wr, 25 op/s rd, 54 op/s wr
>
> My questions are about the damage found on the ~mdsdir scrub, should I
> worry about it ? What does it mean ? It seems to be linked with my issue of
> the high number of strays, is it right ? How to fix it and how to reduce
> num_stray ?
>
> Thank for all
>
> All the best
>
> Arnaud
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx