Re: MDS_DAMAGE in 17.2.7 / Cannot delete affected files

Sebastian Knust <sknust@xxxxxxxxxxxxxxxxxxxxxxx> · Wed, 29 Nov 2023 21:11:15 +0100

Hello Patrick,

On 27.11.23 19:05, Patrick Donnelly wrote:

I would **really** love to see the debug logs from the MDS. Please
upload them using ceph-post-file [1]. If you can reliably reproduce,
turn on more debugging:

ceph config set mds debug_mds 20
ceph config set mds debug_ms 1

[1] https://docs.ceph.com/en/reef/man/8/ceph-post-file/

Uploaded debug log and core dump, see ceph-post-file: 
02f78445-7136-44c9-a362-410de37a0b7d
Unfortunately, we cannot easily shut down normal access to the cluster 
for these tests, therefore there is quite some clutter in the logs. The 
logs show three crashes, the last one with enabled core dumping (ulimits 
set to unlimited)

A note on reproducibility: To recreate the crash, reading the contents 
of the file prior to removal seems necessary. Simply calling stat on the 
file and then performing the removal also yields an Input/output error 
but does not crash the MDS.

Interestingly, the MDS_DAMAGE flag is reset on restart of the MDS and 
only comes back once the files in question are accessed (stat call is 
sufficient).

For now, I'll hold off on running first-damage.py to try to remove the 
affected files / inodes. Ultimately however, this seems to be the most 
sensible solution to me, at least with regards to cluster downtime.

Cheers
Sebastian
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx