Re: MDS lost, Filesystem degraded and wont mount

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Mon, 7 Dec 2020 11:51:34 -0800

On Sat, Dec 5, 2020 at 5:41 AM Janek Bevendorff
<janek.bevendorff@xxxxxxxxxxxxx> wrote:
>
> On 05/12/2020 09:26, Dan van der Ster wrote:
> > Hi Janek,
> >
> > I'd love to hear your standard maintenance procedures. Are you
> > cleaning up those open files outside of "rejoin" OOMs ?
>
> No, of course not. But those rejoin problems happen more often than I'd
> like them to. It has become much better with recent releases, but if one
> of the clients trains a Tensorflow model from files in the CephFS or
> when our Hadoop cluster starts reading from it, the MDS will almost
> certainly crash or at least degrade massively in performance. S3 doesn't
> have these problems at all, obviously.

This sounds like there is one or a few clients acquiring too many
caps. Have you checked this? Are there any messages about the OOM
killer? What config changes for the MDS have you made?

I'm hopeful your problems will be addressed by:
https://tracker.ceph.com/issues/47307

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx