Re: MDS cache always increasing

Alexander Patrakov <patrakov@xxxxxxxxx> · Sat, 31 Aug 2024 08:17:00 +0800

On Fri, Aug 30, 2024 at 9:22 PM Sake Ceph <ceph@xxxxxxxxxxx> wrote:
>
> I hope someone can help us with a MDS caching problem.
>
> Ceph version 18.2.4 with cephadm container deployment.
>
> Question 1:
> For me it's not clear how much cache/memory you should allocate for the MDS. Is this based on the number of open files, caps or something else?
>
> Question 2/Problem:
> At the moment we have MDS nodes with 32 GB of memory and a configured cache limit of 20 GB. There are 4 MDS nodes: 2 active and 2 in standby-replay mode (with max_mds set at 2 of course). We pinned top directories to specific ranks, so the balancer isn't used.
> The memory usage is for the most part increasing, sometimes a little dip with couple hundred MB's freed. After all the memory is consumed, SWAP gets used. This results in a couple of hundred MB's of freed memory, but not much. When eventually the SWAP runs out and the memory is full, the MDS service stops and the cluster logs show:
> 1. no beacon from mds
> 2. marking mds up:active laggy
> 3. replacing mds
> 4. MDS daemon <daemon> is removed because it is dead or otherwise unavailable
>
> For example: we have the top folder app2 and app4 which is pinnend to rank 1. Folder app2 is always accessed by 4 clients (application servers), the same happens with folder app4. Folder app2 is 3 times larger than folder app4 (last time I checked, don't wanna do a du at the moment).
> After a couple of hours the memory usage of the MDS server stays around 18% (Grafana shows a flatline for 7 hours).
> At night the 9the client connects and makes first a backup with rsync of the latest snapshot folder of app2 and afterwards the same happens for folder app4 with a pause for 5 minutes.
> When the backup starts, the memory increases to 70% and stays at 70% after the backup of app2 is completed. 5 minutes later the memory starts increases again with the start of the backup of folder app4. When the backup is done, it's at 78% and stays there for the rest of the day.
> Why isn't the memory usage decreasing after the rsync is completed?
>
> Is there a memory leak with the MDS service?
>
> Ps. I have some small log files/Grafana screenshots, not sure how to share.
>
> Kind regards,
> Sake
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

Hello Sake,

There was a known memory leak in MDS standby-replay:
https://tracker.ceph.com/issues/48673. However, it is supposedly fixed
in version 18.2.4, which you are running. I don't have any non-test
cluster on Reef yet, and so can't tell for sure if this is true.

Therefore, I guess what you see might be a different issue. Could you
please disable standby-replay and retest whether the memory leak still
exists? If so, it would be sufficient proof that it is not the same
issue.

ceph fs set cephfs allow_standby_replay 0

-- 
Alexander Patrakov
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx