On Fri, Aug 30, 2024 at 9:22 PM Sake Ceph <ceph@xxxxxxxxxxx> wrote: > > I hope someone can help us with a MDS caching problem. > > Ceph version 18.2.4 with cephadm container deployment. > > Question 1: > For me it's not clear how much cache/memory you should allocate for the MDS. Is this based on the number of open files, caps or something else? > > Question 2/Problem: > At the moment we have MDS nodes with 32 GB of memory and a configured cache limit of 20 GB. There are 4 MDS nodes: 2 active and 2 in standby-replay mode (with max_mds set at 2 of course). We pinned top directories to specific ranks, so the balancer isn't used. > The memory usage is for the most part increasing, sometimes a little dip with couple hundred MB's freed. After all the memory is consumed, SWAP gets used. This results in a couple of hundred MB's of freed memory, but not much. When eventually the SWAP runs out and the memory is full, the MDS service stops and the cluster logs show: > 1. no beacon from mds > 2. marking mds up:active laggy > 3. replacing mds > 4. MDS daemon <daemon> is removed because it is dead or otherwise unavailable > > For example: we have the top folder app2 and app4 which is pinnend to rank 1. Folder app2 is always accessed by 4 clients (application servers), the same happens with folder app4. Folder app2 is 3 times larger than folder app4 (last time I checked, don't wanna do a du at the moment). > After a couple of hours the memory usage of the MDS server stays around 18% (Grafana shows a flatline for 7 hours). > At night the 9the client connects and makes first a backup with rsync of the latest snapshot folder of app2 and afterwards the same happens for folder app4 with a pause for 5 minutes. > When the backup starts, the memory increases to 70% and stays at 70% after the backup of app2 is completed. 5 minutes later the memory starts increases again with the start of the backup of folder app4. When the backup is done, it's at 78% and stays there for the rest of the day. > Why isn't the memory usage decreasing after the rsync is completed? > > Is there a memory leak with the MDS service? > > Ps. I have some small log files/Grafana screenshots, not sure how to share. > > Kind regards, > Sake > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx Hello Sake, There was a known memory leak in MDS standby-replay: https://tracker.ceph.com/issues/48673. However, it is supposedly fixed in version 18.2.4, which you are running. I don't have any non-test cluster on Reef yet, and so can't tell for sure if this is true. Therefore, I guess what you see might be a different issue. Could you please disable standby-replay and retest whether the memory leak still exists? If so, it would be sufficient proof that it is not the same issue. ceph fs set cephfs allow_standby_replay 0 -- Alexander Patrakov _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx