Re: Reducing RAM usage on production MDS

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Thu, 28 May 2020 09:29:46 +0200

Hi Dylan,

It looks like you have 10GB of heap to be release -- try `ceph tell
mds.$(hostname) heap release` to free that up.

Otherwise, I've found it safe to incrementally inject decreased
mds_cache_memory_limit's on prod mds's running v12.2.12. I'd start by
decreasing the size just a few hundred MBs at a time while tailing the
mds log with `debug mds = 2` or `watch --color ceph fs status` to see
the cache sizes decrease and stabilize after each change.
(In my case I've decreased from ~16GB caches to ~4GB across 9 active
MDSs -- I moved at around 500MB per injection and there were no slow
req's or client issues.)
BTW, we also increase `mds cache trim threshold` to allow the MDS to
trim more caps per 5s tick -- if you find the lru is not trimming
quickly enough you could try 1.5 or 2x the default value.

If things get hairy you could increase the mds_beacon_grace (on mon
and or mds) to tolerate longer missed heartbeats rather than failing
the mds.

Cheers, Dan

On Thu, May 28, 2020 at 7:09 AM Dylan McCulloch <dmc@xxxxxxxxxxxxxx> wrote:
>
> Hi all,
>
> The single active MDS on one of our Ceph clusters is close to running out of RAM.
>
> MDS total system RAM = 528GB
> MDS current free system RAM = 4GB
> mds_cache_memory_limit = 451GB
> current mds cache usage = 426GB
>
> Presumably we need to reduce our mds_cache_memory_limit and/or mds_max_caps_per_client, but would like some guidance on whether it’s possible to do that safely on a live production cluster when the MDS is already pretty close to running out of RAM.
>
> Cluster is Luminous - 12.2.12
> Running single active MDS with two standby.
> 890 clients
> Mix of kernel client (4.19.86) and ceph-fuse.
> Clients are 12.2.12 (398) and 12.2.13 (3)
>
> The kernel clients have stayed under “mds_max_caps_per_client”: “1048576". But the ceph-fuse clients appear to hold very large numbers according to the ceph-fuse asok.
> e.g.
> “num_caps”: 1007144398,
> “num_caps”: 1150184586,
> “num_caps”: 1502231153,
> “num_caps”: 1714655840,
> “num_caps”: 2022826512,
>
> Dropping caches on the clients appears to reduce their cap usage but does not free up RAM on the MDS.
> What is the safest method to free cache and reduce RAM usage on the MDS in this situation (without having to evict or remount clients)?
> I’m concerned that reducing mds_cache_memory_limit even in very small increments may trigger a large recall of caps and overwhelm the MDS.
> We also considered setting a reduced mds_cache_memory_limit on both the standby MDS. Would a subsequent failover to an MDS with a lower cache limit be safe?
> Some more details below and I’d be more than happy to provide additional logs.
>
> Thanks,
> Dylan
>
>
> # free -b
>               total        used        free      shared  buff/cache   available
> Mem:    540954992640 535268749312  4924698624   438284288   761544704  3893182464
> Swap:             0           0           0
>
> # ceph daemon mds.$(hostname -s) config get mds_cache_memory_limit
> {
>     "mds_cache_memory_limit": "450971566080"
> }
>
> # ceph daemon mds.$(hostname -s) cache status
> {
>     "pool": {
>         "items": 10593257843,
>         "bytes": 425176150288
>     }
> }
>
> # ceph daemon mds.$(hostname -s) dump_mempools | grep -A2 "mds_co\|anon"
>     "buffer_anon": {
>         "items": 3935,
>         "bytes": 4537932
> --
>     "mds_co": {
>         "items": 10595391186,
>         "bytes": 425255456209
>
> # ceph daemon mds.$(hostname -s) perf dump | jq '.mds_mem.rss'
> 520100552
>
> # ceph tell mds.$(hostname) heap stats
> tcmalloc heap stats:------------------------------------------------
> MALLOC:   496040753720 (473061.3 MiB) Bytes in use by application
> MALLOC: +  11085479936 (10571.9 MiB) Bytes in page heap freelist
> MALLOC: +  22568895888 (21523.4 MiB) Bytes in central cache freelist
> MALLOC: +        31744 (    0.0 MiB) Bytes in transfer cache freelist
> MALLOC: +     34186296 (   32.6 MiB) Bytes in thread cache freelists
> MALLOC: +   2802057216 ( 2672.2 MiB) Bytes in malloc metadata
> MALLOC:   ------------
> MALLOC: = 532531404800 (507861.5 MiB) Actual memory used (physical + swap)
> MALLOC: +   1315700736 ( 1254.8 MiB) Bytes released to OS (aka unmapped)
> MALLOC:   ------------
> MALLOC: = 533847105536 (509116.3 MiB) Virtual address space used
> MALLOC:
> MALLOC:       44496459              Spans in use
> MALLOC:             22              Thread heaps in use
> MALLOC:           8192              Tcmalloc page size
> ------------------------------------------------
> Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
> Bytes released to the OS take up virtual address space but no physical memory.
>
>
> # ceph fs status
> hpc_projects - 890 clients
> ============
> +------+--------+----------------+---------------+-------+-------+
> | Rank | State  |      MDS       |    Activity   |  dns  |  inos |
> +------+--------+----------------+---------------+-------+-------+
> |  0   | active | mds1-ceph2-qh2 | Reqs:  304 /s |  167M |  167M |
> +------+--------+----------------+---------------+-------+-------+
> +--------------------+----------+-------+-------+
> |        Pool        |   type   |  used | avail |
> +--------------------+----------+-------+-------+
> |   hpcfs_metadata   | metadata | 17.4G | 1893G |
> |     hpcfs_data     |   data   | 1014T |  379T |
> |   test_nvmemeta    |   data   |    0  | 1893G |
> | hpcfs_data_sandisk |   data   |  312T |  184T |
> +--------------------+----------+-------+-------+
>
> +----------------+
> |  Standby MDS   |
> +----------------+
> | mds3-ceph2-qh2 |
> | mds2-ceph2-qh2 |
> +----------------+
> MDS version: ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx