Ceph MDS failing because of corrupted dentries in lost+found after update from 17.2.7 to 18.2.0

Justin Lee <justin.adam.lee@xxxxxxxxx> · Sun, 14 Jul 2024 23:37:28 -0400

After we updated our ceph cluster from 17.2.7 to 18.2.0 the MDS kept being
marked as damaged and stuck in up:standby with these errors in the log.

debug    -12> 2024-07-14T21:22:19.962+0000 7f020cf3a700  1
mds.0.cache.den(0x4 1000b3bcfea) loaded already corrupt dentry:
[dentry #0x1/lost+found/1000b3bcfea [head,head] rep@0.0 NULL (dversion
lock) pv=0 v=2 ino=(nil) state=0 0x558ca63b6500]
debug    -11> 2024-07-14T21:22:19.962+0000 7f020cf3a700 10
mds.0.cache.dir(0x4) go_bad_dentry 1000b3bcfea

these log lines are repeated a bunch of times in our MDS logs, all on
dentries that are within the lost+found directory. After reading this mailing
list post <https://www.spinics.net/lists/ceph-users/msg77325.html>, we
tried setting ceph config set mds mds_go_bad_corrupt_dentry false. This
seemed to successfully circumvent the issue, however, after a few seconds
our MDS crashes. Our 3 MDS are now stuck in a cycle of active -> crash ->
standby -> back to active. Because of this our actual ceph fs is extremely
laggy.

We read here <https://docs.ceph.com/en/latest/releases/reef/#cephfs> that
reef now makes it possible to delete the lost+found directory, which might
solve our problem, but it is inaccessible, to cd, ls, rm, etc.

Has anyone seen this type of issue or know how to solve it? Thanks!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx