Hi Dylan, It looks like you have 10GB of heap to be release -- try `ceph tell mds.$(hostname) heap release` to free that up. Otherwise, I've found it safe to incrementally inject decreased mds_cache_memory_limit's on prod mds's running v12.2.12. I'd start by decreasing the size just a few hundred MBs at a time while tailing the mds log with `debug mds = 2` or `watch --color ceph fs status` to see the cache sizes decrease and stabilize after each change. (In my case I've decreased from ~16GB caches to ~4GB across 9 active MDSs -- I moved at around 500MB per injection and there were no slow req's or client issues.) BTW, we also increase `mds cache trim threshold` to allow the MDS to trim more caps per 5s tick -- if you find the lru is not trimming quickly enough you could try 1.5 or 2x the default value. If things get hairy you could increase the mds_beacon_grace (on mon and or mds) to tolerate longer missed heartbeats rather than failing the mds. Cheers, Dan On Thu, May 28, 2020 at 7:09 AM Dylan McCulloch <dmc@xxxxxxxxxxxxxx> wrote: > > Hi all, > > The single active MDS on one of our Ceph clusters is close to running out of RAM. > > MDS total system RAM = 528GB > MDS current free system RAM = 4GB > mds_cache_memory_limit = 451GB > current mds cache usage = 426GB > > Presumably we need to reduce our mds_cache_memory_limit and/or mds_max_caps_per_client, but would like some guidance on whether it’s possible to do that safely on a live production cluster when the MDS is already pretty close to running out of RAM. > > Cluster is Luminous - 12.2.12 > Running single active MDS with two standby. > 890 clients > Mix of kernel client (4.19.86) and ceph-fuse. > Clients are 12.2.12 (398) and 12.2.13 (3) > > The kernel clients have stayed under “mds_max_caps_per_client”: “1048576". But the ceph-fuse clients appear to hold very large numbers according to the ceph-fuse asok. > e.g. > “num_caps”: 1007144398, > “num_caps”: 1150184586, > “num_caps”: 1502231153, > “num_caps”: 1714655840, > “num_caps”: 2022826512, > > Dropping caches on the clients appears to reduce their cap usage but does not free up RAM on the MDS. > What is the safest method to free cache and reduce RAM usage on the MDS in this situation (without having to evict or remount clients)? > I’m concerned that reducing mds_cache_memory_limit even in very small increments may trigger a large recall of caps and overwhelm the MDS. > We also considered setting a reduced mds_cache_memory_limit on both the standby MDS. Would a subsequent failover to an MDS with a lower cache limit be safe? > Some more details below and I’d be more than happy to provide additional logs. > > Thanks, > Dylan > > > # free -b > total used free shared buff/cache available > Mem: 540954992640 535268749312 4924698624 438284288 761544704 3893182464 > Swap: 0 0 0 > > # ceph daemon mds.$(hostname -s) config get mds_cache_memory_limit > { > "mds_cache_memory_limit": "450971566080" > } > > # ceph daemon mds.$(hostname -s) cache status > { > "pool": { > "items": 10593257843, > "bytes": 425176150288 > } > } > > # ceph daemon mds.$(hostname -s) dump_mempools | grep -A2 "mds_co\|anon" > "buffer_anon": { > "items": 3935, > "bytes": 4537932 > -- > "mds_co": { > "items": 10595391186, > "bytes": 425255456209 > > # ceph daemon mds.$(hostname -s) perf dump | jq '.mds_mem.rss' > 520100552 > > # ceph tell mds.$(hostname) heap stats > tcmalloc heap stats:------------------------------------------------ > MALLOC: 496040753720 (473061.3 MiB) Bytes in use by application > MALLOC: + 11085479936 (10571.9 MiB) Bytes in page heap freelist > MALLOC: + 22568895888 (21523.4 MiB) Bytes in central cache freelist > MALLOC: + 31744 ( 0.0 MiB) Bytes in transfer cache freelist > MALLOC: + 34186296 ( 32.6 MiB) Bytes in thread cache freelists > MALLOC: + 2802057216 ( 2672.2 MiB) Bytes in malloc metadata > MALLOC: ------------ > MALLOC: = 532531404800 (507861.5 MiB) Actual memory used (physical + swap) > MALLOC: + 1315700736 ( 1254.8 MiB) Bytes released to OS (aka unmapped) > MALLOC: ------------ > MALLOC: = 533847105536 (509116.3 MiB) Virtual address space used > MALLOC: > MALLOC: 44496459 Spans in use > MALLOC: 22 Thread heaps in use > MALLOC: 8192 Tcmalloc page size > ------------------------------------------------ > Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). > Bytes released to the OS take up virtual address space but no physical memory. > > > # ceph fs status > hpc_projects - 890 clients > ============ > +------+--------+----------------+---------------+-------+-------+ > | Rank | State | MDS | Activity | dns | inos | > +------+--------+----------------+---------------+-------+-------+ > | 0 | active | mds1-ceph2-qh2 | Reqs: 304 /s | 167M | 167M | > +------+--------+----------------+---------------+-------+-------+ > +--------------------+----------+-------+-------+ > | Pool | type | used | avail | > +--------------------+----------+-------+-------+ > | hpcfs_metadata | metadata | 17.4G | 1893G | > | hpcfs_data | data | 1014T | 379T | > | test_nvmemeta | data | 0 | 1893G | > | hpcfs_data_sandisk | data | 312T | 184T | > +--------------------+----------+-------+-------+ > > +----------------+ > | Standby MDS | > +----------------+ > | mds3-ceph2-qh2 | > | mds2-ceph2-qh2 | > +----------------+ > MDS version: ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable) > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx