Hello to everyone Our ceph cluster is healthy and everything seems to go well but we have a lot of num_strays ceph tell mds.0 perf dump | grep stray "num_strays": 1990574, "num_strays_delayed": 0, "num_strays_enqueuing": 0, "strays_created": 3, "strays_enqueued": 17, "strays_reintegrated": 0, "strays_migrated": 0, And num_strays doesn't seems to reduce whatever we do (scrub / or scrub ~mdsdir) And when we scrub ~mdsdir (force,recursive,repair) we get thoses error { "damage_type": "dir_frag", "id": 3775653237, "ino": 1099569233128, "frag": "*", "path": "~mds0/stray3/100036efce8" }, { "damage_type": "dir_frag", "id": 3776355973, "ino": 1099567262916, "frag": "*", "path": "~mds0/stray3/1000350ecc4" }, { "damage_type": "dir_frag", "id": 3776485071, "ino": 1099559071399, "frag": "*", "path": "~mds0/stray4/10002d3eea7" }, And just before the end of the ~mdsdir scrub the mds crashes and I have to do a ceph mds repaired 0 to have the filesystem back online A lot of them. Do you have any ideas of what those errors are and how should I handle them ? We have a lot of data in our cephfs cluster 350 TB+ and we takes snapshot everyday of / and keep them for 1 month (rolling) here is our cluster state ceph -s cluster: id: 817b5736-84ae-11eb-bf7b-c9513f2d60a9 health: HEALTH_WARN 78 pgs not deep-scrubbed in time 70 pgs not scrubbed in time services: mon: 3 daemons, quorum ceph-r-112-1,ceph-g-112-3,ceph-g-112-2 (age 10d) mgr: ceph-g-112-2.ghcodb(active, since 4d), standbys: ceph-g-112-1.ksojnh mds: 1/1 daemons up, 1 standby osd: 67 osds: 67 up (since 14m), 67 in (since 7d) data: volumes: 1/1 healthy pools: 5 pools, 609 pgs objects: 186.86M objects, 231 TiB usage: 351 TiB used, 465 TiB / 816 TiB avail pgs: 502 active+clean 82 active+clean+snaptrim_wait 20 active+clean+snaptrim 4 active+clean+scrubbing+deep 1 active+clean+scrubbing+deep+snaptrim_wait io: client: 8.8 MiB/s rd, 39 MiB/s wr, 25 op/s rd, 54 op/s wr My questions are about the damage found on the ~mdsdir scrub, should I worry about it ? What does it mean ? It seems to be linked with my issue of the high number of strays, is it right ? How to fix it and how to reduce num_stray ? Thank for all All the best Arnaud _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx