Re: Help needed, ceph fs down due to large stray dir

Frank Schilder <frans@xxxxxx> · Fri, 10 Jan 2025 18:05:07 +0000

Hi Patrick and others,

thanks for your fast reply. The problem we are in comes from forward scrub ballooning and the memory overuse did not go away even after aborting the scrub. The "official" way to evaluate strays I got from Neha was to restart the rank.

I did not expect that the MDS needs to load the entire stray buckets into RAM just for processing it. I expected that this situation was considered during development.

The answer to when the problem occurs: yes, the MDS goes active and then starts loading incredible amounts of entries.

I can go and give it swap, but might try with RAM only first (we have 512G machines, just need to stop the OSDs on the server).

I will report back what happens.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Patrick Donnelly <pdonnell@xxxxxxxxxx>
Sent: Friday, January 10, 2025 6:57 PM
To: Frank Schilder
Cc: ceph-users@xxxxxxx
Subject: Re:  Help needed, ceph fs down due to large stray dir

Hi Frank,

On Fri, Jan 10, 2025 at 12:31 PM Frank Schilder <frans@xxxxxx> wrote:
>
> Hi all,
>
> we seem to have a serious issue with our file system, ceph version is pacific latest. After a large cleanup operation we had an MDS rank with 100Mio stray entries (yes, one hundred million). Today we restarted this daemon, which cleans up the stray entries.

... why would you restart the daemon? I can't stress this question
enough. Usually when CephFS has a "meltdown", the trigger was "I
restarted the MDS" hoping that "X relatively minor problem" would go
away.

> It seems that this leads to a restart loop due to OOM. The rank becomes active and then starts pulling in DNS and INOS entries until all memory is exhausted.
>
> I have no idea if there is at least progress removing the stray items or if it starts from scratch every time. If it needs to pull as many DNS/INOS into cache as there are stray items, we don't have a server at hand with enough RAM.

Some strays may not be eligible for removal due to hard links or snapshots.

> Q1: Is the MDS at least making progress in every restart iteration?

Probably not.

> Q2: If not, how do we get this rank up again?

I don't see an easy way to circumvent this problem with any type of
hacks/configs. One option you have is to allocate a suitably large
swap file for the MDS node to see if it can chew through the stray
directories. (More RAM would be better...)

> Q3: If we can't get this rank up soon, can we at least move directories away from this rank by pinning it to another rank?

Afraid not. You cannot migrate strays and it wouldn't take effect in
time anyway.

> Currently, the rank in question reports .mds_cache.num_strays=0 in perf dump.

That's probably out-of-date.

Checking: this MDS runs out of memory shortly after becoming active right?

--
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx