Re: MDS crashes to damaged metadata

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Patrick,

it has been a year now and we did not have a single crash since upgrading to 16.2.13. We still have the 19 corrupted files which are reported by 'damage ls‘. Is it now possible to delete the corrupted files without taking the filesystem offline?

Am 22.05.2023 um 20:23 schrieb Patrick Donnelly <pdonnell@xxxxxxxxxx>:

Hi Felix,

On Sat, May 13, 2023 at 9:18 AM Stolte, Felix <f.stolte@xxxxxxxxxxxxx> wrote:

Hi Patrick,

we have been running one daily snapshot since december and our cephfs crashed 3 times because of this https://tracker.ceph.com/issues/38452

We currentliy have 19 files with corrupt metadata found by your first-damage.py script. We isolated the these files from access by users and are waiting for a fix before we remove them with your script (or maybe a new way?)

No other fix is anticipated at this time. Probably one will be
developed after the cause is understood.

Today we upgraded our cluster from 16.2.11 and 16.2.13. After Upgrading the mds  servers, cluster health went to ERROR MDS_DAMAGE. 'ceph tells mds 0 damage ls‘ is showing me the same files as your script (initially only a part, after a cephfs scrub all of them).

This is expected. Once the dentries are marked damaged, the MDS won't
allow operations on those files (like those triggering tracker
#38452).

I noticed "mds: catch damage to CDentry’s first member before persisting (issue#58482, pr#50781, Patrick Donnelly)“ in the change logs for 16.2.13  and like to ask you the following questions:

a) can we repair the damaged files online now instead of bringing down the whole fs and using the python script?

Not yet.

b) should we set one of the new mds options in our specific case to avoid our fileserver crashing because of the wrong snap ids?

Have your MDS crashed or just marked the dentries damaged? If you can
reproduce a crash with detailed logs (debug_mds=20), that would be
incredibly helpful.

c) will your patch prevent wrong snap ids in the future?

It will prevent persisting the damage.


--
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D


mit freundlichem Gruß
Felix Stolte

IT-Services
mailto: f.stolte@xxxxxxxxxxxxx
Tel: 02461-619243

---------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Stefan Müller
Geschäftsführung: Prof. Dr. Astrid Lambrecht (Vorsitzende),
Karsten Beneke (stellv. Vorsitzender), Dr. Ir. Pieter Jansens
---------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux