Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Thu, 8 Jun 2023 15:52:43 -0400

On Mon, Jun 5, 2023 at 11:48 AM Janek Bevendorff
<janek.bevendorff@xxxxxxxxxxxxx> wrote:
>
> Hi Patrick, hi Dan!
>
> I got the MDS back and I think the issue is connected to the "newly
> corrupt dentry" bug [1]. Even though I couldn't see any particular
> reason for the SIGABRT at first, I then noticed one of these awfully
> familiar stack traces.
>
> I rescheduled the two broken MDS ranks on two machines with 1.5TB RAM
> each (just to make sure it's not that) and then let them do their thing.
> The routine goes as follows: both replay the journal, then rank 4 goes
> into the "resolve" state, but as soon as rank 3 also starts resolving,
> they both crash.
>
> Then I set
>
> ceph config mds mds_abort_on_newly_corrupt_dentry false
> ceph config mds mds_go_bad_corrupt_dentry false
>
> and this time I was able to recover the ranks, even though "resolve" and
> "clientreplay" took forever. I uploaded a compressed log of rank 3 using
> ceph-post-file [2]. It's a log of several crash cycles, including the
> final successful attempt after changing the settings. The log
> decompresses to 815MB. I didn't censor any paths and they are not
> super-secret, but please don't share.

Probably only

ceph config mds mds_go_bad_corrupt_dentry false

was necessary for recovery. You don't have any logs showing it hit
those asserts?

I'm afraid your ceph-post-file logs were lost to the nether. AFAICT,
our ceph-post-file storage has been non-functional since the beginning
of the lab outage last year. We're looking into it.

> While writing this, the metadata pool size has reduced from 6TiB back to
> 440GiB. I am starting to think that the fill-ups may also be connected
> to the corruption issue.

Extremely unlikely.

> I also noticed that the ranks 3 and 4 always
> have huge journals. An inspection using ceph-journal-tool takes forever
> and consumes 50GB of memory in the process. Listing the events in the
> journal is impossible without running out of RAM. Ranks 0, 1, and 2
> don't have this problem and this wasn't a problem for ranks 3 and 4
> either before the fill-ups started happening.

So clearly (a) an incredible number of journal events are being logged
and (b) trimming is slow or unable to make progress. I'm looking into
why but you can help by running the attached script when the problem
is occurring so I can investigate. I'll need a tarball of the outputs.

Also, in the off-chance this is related to the MDS balancer, please
disable it since you're using ephemeral pinning:

ceph config set mds mds_bal_interval 0

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx