Re: MDS Stability with lots of CAPS

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Sat, 5 Oct 2019 04:00:06 -0700

On Wed, Oct 2, 2019 at 9:48 AM Stefan Kooman <stefan@xxxxxx> wrote:
> According to [1] there are new parameters in place to have the MDS
> behave more stable. Quoting that blog post "One of the more recent
> issues weve discovered is that an MDS with a very large cache (64+GB)
> will hang during certain recovery events."
>
> For all of us that are not (yet) running Nautilus I wonder what the best
> course of action is to prevent instable MDS during recovery situations.
>
> Artificially limit the "mds_cache_memory_limit" to say 32 GB?

Reduce the MDS cache size.

Mimic backport will probably make next minor release:
https://github.com/ceph/ceph/pull/28452

> I wonder if the amount of clients is of influence in a MDS being
> overwhelmed by release messages. Of are a handfull of clients (with
> millions of CAPS) able to overload an MDS?

Just one client with millions of caps could cause issues.

> Is there a way, other than unmounting cephfs on clients, to decrease the
> amount of CAPS the MDS has handed out, before an upgrade to a newer Ceph
> release is undertaken when running luminous / Mimic?

Incrementally reduce the cache size using a script.

> I'm assuming you need to restart the MDS to make the
> "mds_cache_memory_limit" effective, is that correct?

No. It is respected at runtime.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com