On Sat, Dec 5, 2020 at 5:41 AM Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> wrote: > > On 05/12/2020 09:26, Dan van der Ster wrote: > > Hi Janek, > > > > I'd love to hear your standard maintenance procedures. Are you > > cleaning up those open files outside of "rejoin" OOMs ? > > No, of course not. But those rejoin problems happen more often than I'd > like them to. It has become much better with recent releases, but if one > of the clients trains a Tensorflow model from files in the CephFS or > when our Hadoop cluster starts reading from it, the MDS will almost > certainly crash or at least degrade massively in performance. S3 doesn't > have these problems at all, obviously. This sounds like there is one or a few clients acquiring too many caps. Have you checked this? Are there any messages about the OOM killer? What config changes for the MDS have you made? I'm hopeful your problems will be addressed by: https://tracker.ceph.com/issues/47307 -- Patrick Donnelly, Ph.D. He / Him / His Principal Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx