Just to follow up with an anecdote -- I had asked the question because we had to do a planned failover of one of our MDSs. The intervention went well and we didn't need to remove the openfiles table objects. We stopped the active mds.0 then the standby took over -- the rejoin step took around 5 minutes and during that time the MDS memory ballooned to 41GB (10x the configured cache memory limit of 4GB). Thankfully the machine had 64GB so it didn't go OOM this time. Best Regards, Dan On Thu, Jan 21, 2021 at 4:51 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > > Hi all, > > During rejoin an MDS can sometimes go OOM if the openfiles table is too large. > The workaround has been described by ceph devs as "rados rm -p > cephfs_metadata mds0_openfiles.0". > > On our cluster we have several such objects for rank 0: > > mds0_openfiles.0 exists with size: 199978 > mds0_openfiles.1 exists with size: 153650 > mds0_openfiles.2 exists with size: 40987 > mds0_openfiles.3 exists with size: 7746 > mds0_openfiles.4 exists with size: 413 > > If we suffer such an OOM, do we need to rm *all* of those objects or > only the `.0` object? > > Best Regards, > > Dan _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx