Re: MDS_DAMAGE in 17.2.7 / Cannot delete affected files

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Wed, 29 Nov 2023 21:58:16 -0500

Hi Sebastian,

On Wed, Nov 29, 2023 at 3:11 PM Sebastian Knust
<sknust@xxxxxxxxxxxxxxxxxxxxxxx> wrote:
>
> Hello Patrick,
>
> On 27.11.23 19:05, Patrick Donnelly wrote:
> >
> > I would **really** love to see the debug logs from the MDS. Please
> > upload them using ceph-post-file [1]. If you can reliably reproduce,
> > turn on more debugging:
> >
> >> ceph config set mds debug_mds 20
> >> ceph config set mds debug_ms 1
> >
> > [1] https://docs.ceph.com/en/reef/man/8/ceph-post-file/
> >
>
> Uploaded debug log and core dump, see ceph-post-file:
> 02f78445-7136-44c9-a362-410de37a0b7d
> Unfortunately, we cannot easily shut down normal access to the cluster
> for these tests, therefore there is quite some clutter in the logs. The
> logs show three crashes, the last one with enabled core dumping (ulimits
> set to unlimited)
>
> A note on reproducibility: To recreate the crash, reading the contents
> of the file prior to removal seems necessary. Simply calling stat on the
> file and then performing the removal also yields an Input/output error
> but does not crash the MDS.
>
> Interestingly, the MDS_DAMAGE flag is reset on restart of the MDS and
> only comes back once the files in question are accessed (stat call is
> sufficient).

I've not yet fully reviewed the logs but it seems there is a bug in
the detection logic which causes a spurious abort. This does not
appear to be actually new damage.

Are you using postgres? If you can share details about your snapshot
workflow and general workloads that would be helpful (privately if
desired).

> For now, I'll hold off on running first-damage.py to try to remove the
> affected files / inodes. Ultimately however, this seems to be the most
> sensible solution to me, at least with regards to cluster downtime.

Please give me another day to review then feel free to use
first-damage.py to cleanup. If you see new damage please upload the
logs.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx