You could try some of the steps here Frank: https://docs.ceph.com/en/quincy/cephfs/troubleshooting/#avoiding-recovery-roadblocks mds_heartbeat_reset_grace is probably the only one really relevant to your scenario. On Fri, Jan 10, 2025 at 1:30 PM Frank Schilder <frans@xxxxxx> wrote: > Hi all, > > we seem to have a serious issue with our file system, ceph version is > pacific latest. After a large cleanup operation we had an MDS rank with > 100Mio stray entries (yes, one hundred million). Today we restarted this > daemon, which cleans up the stray entries. It seems that this leads to a > restart loop due to OOM. The rank becomes active and then starts pulling in > DNS and INOS entries until all memory is exhausted. > > I have no idea if there is at least progress removing the stray items or > if it starts from scratch every time. If it needs to pull as many DNS/INOS > into cache as there are stray items, we don't have a server at hand with > enough RAM. > > Q1: Is the MDS at least making progress in every restart iteration? > Q2: If not, how do we get this rank up again? > Q3: If we can't get this rank up soon, can we at least move directories > away from this rank by pinning it to another rank? > > Currently, the rank in question reports .mds_cache.num_strays=0 in perf > dump. > > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx