Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> · Fri, 9 Jun 2023 09:27:27 +0200

Hi Patrick,

I'm afraid your ceph-post-file logs were lost to the nether. AFAICT,
our ceph-post-file storage has been non-functional since the beginning
of the lab outage last year. We're looking into it.

I have it here still. Any other way I can send it to you?

Extremely unlikely.

Okay, taking your word for it. But something seems to be stalling 
journal trimming. We had a similar thing yesterday evening, but at much 
smaller scale without noticeable pool size increase. I only got an alert 
that the ceph_mds_log_ev Prometheus metric starting going up again for a 
single MDS. It grew past 1M events, so I restarted it. I also restarted 
the other MDS and they all immediately jumped to above 5M events and 
stayed there. They are, in fact, still there and have decreased only 
very slightly in the morning. The pool size is totally within a normal 
range, though, at 290GiB.

So clearly (a) an incredible number of journal events are being logged
and (b) trimming is slow or unable to make progress. I'm looking into
why but you can help by running the attached script when the problem
is occurring so I can investigate. I'll need a tarball of the outputs.

How do I send it to you if not via ceph-post-file?

Also, in the off-chance this is related to the MDS balancer, please
disable it since you're using ephemeral pinning:

ceph config set mds mds_bal_interval 0

Done.

Thanks for your help!
Janek

--

Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx