Re: Help needed, ceph fs down due to large stray dir

Spencer Macphee <spencerofsydney@xxxxxxxxx> · Fri, 10 Jan 2025 18:07:12 -0400

You could try some of the steps here Frank:
https://docs.ceph.com/en/quincy/cephfs/troubleshooting/#avoiding-recovery-roadblocks

mds_heartbeat_reset_grace is probably the only one really relevant to your
scenario.

On Fri, Jan 10, 2025 at 1:30 PM Frank Schilder <frans@xxxxxx> wrote:

> Hi all,
>
> we seem to have a serious issue with our file system, ceph version is
> pacific latest. After a large cleanup operation we had an MDS rank with
> 100Mio stray entries (yes, one hundred million). Today we restarted this
> daemon, which cleans up the stray entries. It seems that this leads to a
> restart loop due to OOM. The rank becomes active and then starts pulling in
> DNS and INOS entries until all memory is exhausted.
>
> I have no idea if there is at least progress removing the stray items or
> if it starts from scratch every time. If it needs to pull as many DNS/INOS
> into cache as there are stray items, we don't have a server at hand with
> enough RAM.
>
> Q1: Is the MDS at least making progress in every restart iteration?
> Q2: If not, how do we get this rank up again?
> Q3: If we can't get this rank up soon, can we at least move directories
> away from this rank by pinning it to another rank?
>
> Currently, the rank in question reports .mds_cache.num_strays=0 in perf
> dump.
>
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx