MDS cache always increasing

Sake Ceph <ceph@xxxxxxxxxxx> · Fri, 30 Aug 2024 15:21:46 +0200 (CEST)

I hope someone can help us with a MDS caching problem.

Ceph version 18.2.4 with cephadm container deployment.

Question 1:
For me it's not clear how much cache/memory you should allocate for the MDS. Is this based on the number of open files, caps or something else?

Question 2/Problem:
At the moment we have MDS nodes with 32 GB of memory and a configured cache limit of 20 GB. There are 4 MDS nodes: 2 active and 2 in standby-replay mode (with max_mds set at 2 of course). We pinned top directories to specific ranks, so the balancer isn't used.
The memory usage is for the most part increasing, sometimes a little dip with couple hundred MB's freed. After all the memory is consumed, SWAP gets used. This results in a couple of hundred MB's of freed memory, but not much. When eventually the SWAP runs out and the memory is full, the MDS service stops and the cluster logs show:
1. no beacon from mds
2. marking mds up:active laggy
3. replacing mds
4. MDS daemon <daemon> is removed because it is dead or otherwise unavailable

For example: we have the top folder app2 and app4 which is pinnend to rank 1. Folder app2 is always accessed by 4 clients (application servers), the same happens with folder app4. Folder app2 is 3 times larger than folder app4 (last time I checked, don't wanna do a du at the moment).
After a couple of hours the memory usage of the MDS server stays around 18% (Grafana shows a flatline for 7 hours).
At night the 9the client connects and makes first a backup with rsync of the latest snapshot folder of app2 and afterwards the same happens for folder app4 with a pause for 5 minutes.
When the backup starts, the memory increases to 70% and stays at 70% after the backup of app2 is completed. 5 minutes later the memory starts increases again with the start of the backup of folder app4. When the backup is done, it's at 78% and stays there for the rest of the day.
Why isn't the memory usage decreasing after the rsync is completed?

Is there a memory leak with the MDS service?

Ps. I have some small log files/Grafana screenshots, not sure how to share.

Kind regards,
Sake
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx