Re: Reducing RAM usage on production MDS

Dylan McCulloch <dmc@xxxxxxxxxxxxxx> · Wed, 10 Jun 2020 08:08:25 +0000

Thanks very much for the responses and guidance. Just for some belated closure regarding this (and for the archives), we gradually decremented the mds_cache_memory_limit by a few hundred MBs at a time while monitoring and everything was fine.
New mds_cache_memory_limit is 318208819200 (319GB).

Cheers,
Dylan

>On Wed, May 27, 2020 at 10:09 PM Dylan McCulloch <dmc@xxxxxxxxxxxxxx> wrote:
>>
>> Hi all,
>>
>> The single active MDS on one of our Ceph clusters is close to running out of RAM.
>>
>> MDS total system RAM = 528GB
>> MDS current free system RAM = 4GB
>> mds_cache_memory_limit = 451GB
>> current mds cache usage = 426GB
>
>This mds_cache_memory_limit is way too high for the available RAM. We
>normally recommend that your RAM be 150% of your cache limit but we
>lack data for such large cache sizes.
>
>> Presumably we need to reduce our mds_cache_memory_limit and/or mds_max_caps_per_client, but would like some guidance on whether it’s possible to do that safely on a live production cluster when the MDS is already pretty close to running out of RAM.
>>
>> Cluster is Luminous - 12.2.12
>> Running single active MDS with two standby.
>> 890 clients
>> Mix of kernel client (4.19.86) and ceph-fuse.
>> Clients are 12.2.12 (398) and 12.2.13 (3)
>
>v12.2.12 has the changes necessary to throttle MDS cache size
>reduction. You should be able to reduce mds_cache_memory_limit to any
>lower value without destabilizing the cluster.
>
>> The kernel clients have stayed under “mds_max_caps_per_client”: “1048576". But the ceph-fuse clients appear to hold very large numbers according to the ceph-fuse asok.
>> e.g.
>> “num_caps”: 1007144398,
>> “num_caps”: 1150184586,
>> “num_caps”: 1502231153,
>> “num_caps”: 1714655840,
>> “num_caps”: 2022826512,
>
>This data from the ceph-fuse asok is actually the number of caps ever
>received, not the current number. I've created a ticket for this:
>https://tracker.ceph.com/issues/45749
>
>Look at the data from `ceph tell mds.foo session ls` instead.
>
>
>> Dropping caches on the clients appears to reduce their cap usage but does not free up RAM on the MDS.
>
>The MDS won't free up RAM until the cache memory limit is reached.
>
>> What is the safest method to free cache and reduce RAM usage on the MDS in this situation (without having to evict or remount clients)?
>
>reduce mds_cache_memory_limit
>
>> I’m concerned that reducing mds_cache_memory_limit even in very small increments may trigger a large recall of caps and overwhelm the MDS.
>
>That used to be the case in older versions of Luminous but not any longer.
>
>--
>Patrick Donnelly, Ph.D.
>He / Him / His
>Senior Software Engineer
>Red Hat Sunnyvale, CA
>GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx