On Mon, Jun 5, 2023 at 11:48 AM Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> wrote: > > Hi Patrick, hi Dan! > > I got the MDS back and I think the issue is connected to the "newly > corrupt dentry" bug [1]. Even though I couldn't see any particular > reason for the SIGABRT at first, I then noticed one of these awfully > familiar stack traces. > > I rescheduled the two broken MDS ranks on two machines with 1.5TB RAM > each (just to make sure it's not that) and then let them do their thing. > The routine goes as follows: both replay the journal, then rank 4 goes > into the "resolve" state, but as soon as rank 3 also starts resolving, > they both crash. > > Then I set > > ceph config mds mds_abort_on_newly_corrupt_dentry false > ceph config mds mds_go_bad_corrupt_dentry false > > and this time I was able to recover the ranks, even though "resolve" and > "clientreplay" took forever. I uploaded a compressed log of rank 3 using > ceph-post-file [2]. It's a log of several crash cycles, including the > final successful attempt after changing the settings. The log > decompresses to 815MB. I didn't censor any paths and they are not > super-secret, but please don't share. Probably only ceph config mds mds_go_bad_corrupt_dentry false was necessary for recovery. You don't have any logs showing it hit those asserts? I'm afraid your ceph-post-file logs were lost to the nether. AFAICT, our ceph-post-file storage has been non-functional since the beginning of the lab outage last year. We're looking into it. > While writing this, the metadata pool size has reduced from 6TiB back to > 440GiB. I am starting to think that the fill-ups may also be connected > to the corruption issue. Extremely unlikely. > I also noticed that the ranks 3 and 4 always > have huge journals. An inspection using ceph-journal-tool takes forever > and consumes 50GB of memory in the process. Listing the events in the > journal is impossible without running out of RAM. Ranks 0, 1, and 2 > don't have this problem and this wasn't a problem for ranks 3 and 4 > either before the fill-ups started happening. So clearly (a) an incredible number of journal events are being logged and (b) trimming is slow or unable to make progress. I'm looking into why but you can help by running the attached script when the problem is occurring so I can investigate. I'll need a tarball of the outputs. Also, in the off-chance this is related to the MDS balancer, please disable it since you're using ephemeral pinning: ceph config set mds mds_bal_interval 0 -- Patrick Donnelly, Ph.D. He / Him / His Red Hat Partner Engineer IBM, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx