Hi Felix, On Sat, May 13, 2023 at 9:18 AM Stolte, Felix <f.stolte@xxxxxxxxxxxxx> wrote: > > Hi Patrick, > > we have been running one daily snapshot since december and our cephfs crashed 3 times because of this https://tracker.ceph.com/issues/38452 > > We currentliy have 19 files with corrupt metadata found by your first-damage.py script. We isolated the these files from access by users and are waiting for a fix before we remove them with your script (or maybe a new way?) No other fix is anticipated at this time. Probably one will be developed after the cause is understood. > Today we upgraded our cluster from 16.2.11 and 16.2.13. After Upgrading the mds servers, cluster health went to ERROR MDS_DAMAGE. 'ceph tells mds 0 damage ls‘ is showing me the same files as your script (initially only a part, after a cephfs scrub all of them). This is expected. Once the dentries are marked damaged, the MDS won't allow operations on those files (like those triggering tracker #38452). > I noticed "mds: catch damage to CDentry’s first member before persisting (issue#58482, pr#50781, Patrick Donnelly)“ in the change logs for 16.2.13 and like to ask you the following questions: > > a) can we repair the damaged files online now instead of bringing down the whole fs and using the python script? Not yet. > b) should we set one of the new mds options in our specific case to avoid our fileserver crashing because of the wrong snap ids? Have your MDS crashed or just marked the dentries damaged? If you can reproduce a crash with detailed logs (debug_mds=20), that would be incredibly helpful. > c) will your patch prevent wrong snap ids in the future? It will prevent persisting the damage. -- Patrick Donnelly, Ph.D. He / Him / His Red Hat Partner Engineer IBM, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx