Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> · Tue, 6 Jun 2023 09:26:03 +0200

I guess the mailing list didn't preserve the embedded image. Here's an 
Imgur link: https://imgur.com/a/WSmAOaG

I checked the logs as far back as we have them. The issue started 
appearing only after my last Ceph upgrade on 2 May, which introduced the 
new corruption assertion.

On 06/06/2023 09:16, Janek Bevendorff wrote:
I checked our Prometheus logs and the number of log events of 
individual MONs are indeed randomly starting to increase dramatically 
all of a sudden. I attached a picture of the curves.

The first incidence you see there was when our metadata store filled 
up entirely. The second, smaller one was the more controlled fill-up. 
The last instance with only one runaway MDS is what I have just reported.

My unqualified wild guess is that the new safeguard to prevent the MDS 
from committing corrupt dentries is holding up the queue, so all of a 
sudden events are starting to pile up until the store is full.

On 05/06/2023 18:03, Janek Bevendorff wrote:
That said, our MON store size has also been growing slowly from 900MB 
to 5.4GB. But we also have a few remapped PGs right now. Not sure if 
that would have an influence.

On 05/06/2023 17:48, Janek Bevendorff wrote:
Hi Patrick, hi Dan!

I got the MDS back and I think the issue is connected to the "newly 
corrupt dentry" bug [1]. Even though I couldn't see any particular 
reason for the SIGABRT at first, I then noticed one of these awfully 
familiar stack traces.

I rescheduled the two broken MDS ranks on two machines with 1.5TB 
RAM each (just to make sure it's not that) and then let them do 
their thing. The routine goes as follows: both replay the journal, 
then rank 4 goes into the "resolve" state, but as soon as rank 3 
also starts resolving, they both crash.

Then I set

ceph config mds mds_abort_on_newly_corrupt_dentry false
ceph config mds mds_go_bad_corrupt_dentry false

and this time I was able to recover the ranks, even though "resolve" 
and "clientreplay" took forever. I uploaded a compressed log of rank 
3 using ceph-post-file [2]. It's a log of several crash cycles, 
including the final successful attempt after changing the settings. 
The log decompresses to 815MB. I didn't censor any paths and they 
are not super-secret, but please don't share.

While writing this, the metadata pool size has reduced from 6TiB 
back to 440GiB. I am starting to think that the fill-ups may also be 
connected to the corruption issue. I also noticed that the ranks 3 
and 4 always have huge journals. An inspection using 
ceph-journal-tool takes forever and consumes 50GB of memory in the 
process. Listing the events in the journal is impossible without 
running out of RAM. Ranks 0, 1, and 2 don't have this problem and 
this wasn't a problem for ranks 3 and 4 either before the fill-ups 
started happening.

Hope that helps getting to the bottom of this. I reset the guardrail 
settings in the meantime.

Cheers
Janek

[1] "Newly corrupt dentry" ML link: 
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/JNZ6V5WSYKQTNQPQPLWRBM2GEP2YSCRV/#PKQVZYWZCH7P76Q75D5WD5JEAVWOKJE3

[2] ceph-post-file ID: 7c039483-49fd-468c-ba40-fb10337aa7d6

On 05/06/2023 16:08, Janek Bevendorff wrote:
I just had the problem again that MDS were constantly reporting 
slow metadata IO and the pool was slowly growing. Hence I restarted 
the MDS and now ranks 4 and 5 don't come up again.

Every time, they get to the resolve stage, the crash with a SIGABRT 
without an error message (not even at debug_mds = 20). Any idea 
what the reason could be? I checked whether they have enough RAM, 
which seems to be the case (unless they try to allocate tens of GB 
in one allocation).

Janek

On 31/05/2023 21:57, Janek Bevendorff wrote:
Hi Dan,

Sorry, I meant Pacific. The version number was correct, the name 
wasn’t. ;-)

Yes, I have five active MDS and five hot standbys. Static pinning 
isn’t really an options for our directory structure, so we’re 
using ephemeral pins.

Janek

On 31. May 2023, at 18:44, Dan van der Ster 
<dan.vanderster@xxxxxxxxx> wrote:

Hi Janek,

A few questions and suggestions:
- Do you have multi-active MDS? In my experience back in nautilus if
something went wrong with mds export between mds's, the mds
log/journal could grow unbounded like you observed until that export
work was done. Static pinning could help if you are not using it
already.
- You definitely should disable the pg autoscaling on the mds 
metadata
pool (and other pools imho) -- decide the correct number of PGs for
your pools and leave it.
- Which version are you running? You said nautilus but wrote 16.2.12
which is pacific... If you're running nautilus v14 then I recommend
disabling pg autoscaling completely -- IIRC it does not have a 
fix for
the OSD memory growth "pg dup" issue which can occur during PG
splitting/merging.

Cheers, Dan

______________________________
Clyso GmbH | https://www.clyso.com

On Wed, May 31, 2023 at 4:03 AM Janek Bevendorff
<janek.bevendorff@xxxxxxxxxxxxx> wrote:
I checked our logs from yesterday, the PG scaling only started 
today,
perhaps triggered by the snapshot trimming. I disabled it, but 
it didn't
change anything.

What did change something was restarting the MDS one by one, 
which had
got far behind with trimming their caches and with a bunch of 
stuck ops.
After restarting them, the pool size decreased quickly to 600GiB. I
noticed the same behaviour yesterday, though yesterday is was more
extreme and restarting the MDS took about an hour and I had to 
increase
the heartbeat timeout. This time, it took only half a minute per 
MDS,
probably because it wasn't that extreme yet and I had reduced the
maximum cache size. Still looks like a bug to me.

On 31/05/2023 11:18, Janek Bevendorff wrote:
Another thing I just noticed is that the auto-scaler is trying to
scale the pool down to 128 PGs. That could also result in large
fluctuations, but this big?? In any case, it looks like a bug 
to me.
Whatever is happening here, there should be safeguards with 
regard to
the pool's capacity.

Here's the current state of the pool in ceph osd pool ls detail:

pool 110 'cephfs.storage.meta' replicated size 4 min_size 3 
crush_rule
5 object_hash rjenkins pg_num 495 pgp_num 471 pg_num_target 128
pgp_num_target 128 autoscale_mode on last_change 1359013 lfor
0/1358620/1358618 flags hashpspool,nodelete stripe_width 0
expected_num_objects 3000000 recovery_op_priority 5 
recovery_priority
2 application cephfs

Janek

On 31/05/2023 10:10, Janek Bevendorff wrote:
Forgot to add: We are still on Nautilus (16.2.12).

On 31/05/2023 09:53, Janek Bevendorff wrote:
Hi,

Perhaps this is a known issue and I was simply too dumb to 
find it,
but we are having problems with our CephFS metadata pool 
filling up
over night.

Our cluster has a small SSD pool of around 15TB which hosts our
CephFS metadata pool. Usually, that's more than enough. The 
normal
size of the pool ranges between 200 and 800GiB (which is 
quite a lot
of fluctuation already). Yesterday, we had suddenly had the pool
fill up entirely and they only way to fix it was to add more
capacity. I increased the pool size to 18TB by adding more 
SSDs and
could resolve the problem. After a couple of hours of 
reshuffling,
the pool size finally went back to 230GiB.

But then we had another fill-up tonight to 7.6TiB. Luckily, I 
had
adjusted the weights so that not all disks could fill up 
entirely
like last time, so it ended there.

I wasn't really able to identify the problem yesterday, but 
under
the more controllable scenario today, I could check the MDS 
logs at
debug_mds=10 and to me it seems like the problem is caused by
snapshot trimming. The logs contain a lot of snapshot-related
messages for paths that haven't been touched in a long time. The
messages all look something like this:

May 31 09:16:48 XXX ceph-mds[2947525]: 
2023-05-31T09:16:48.292+0200
7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap 
first
cap, joining realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1
b1b cps 2 snaps={185f=snap(185f 0x10000000000 'monthly_20221201'
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000
'monthly_20230101' 
2023-01-01T00:00:04.657252+0100),1941=snap(1941
0x10000000000 ...

May 31 09:25:03 XXX ceph-mds[3268481]: 
2023-05-31T09:25:03.396+0200
7f0e6a6ca700 10 mds.0.cache | |______ 3     rep [dir
0x100000218fe.101111101* /storage/REDACTED/| ptrwaiter=0 
request=0
child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 
authpin=0
tempexporting=0 0x5607759d9600]

May 31 09:25:03 XXX ceph-mds[3268481]: 
2023-05-31T09:25:03.452+0200
7f0e6a6ca700 10 mds.0.cache | | |____ 4     rep [dir
0x100000ff904.100111101010* /storage/REDACTED/| ptrwaiter=0
request=0 child=0 frozen=0 subtree=1 importing=0 replicated=0
waiter=0 authpin=0 tempexporting=0 0x56034ed25200]

May 31 09:25:03 XXX ceph-mds[3268481]: 
2023-05-31T09:25:03.716+0200
7f0e6becd700 10 mds.0.server set_trace_dist snaprealm
snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2
snaps={185f=snap(185f 0x10000000000 'monthly_20221201'
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000
'monthly_20230101' 
2023-01-01T00:00:04.657252+0100),1941=snap(1941
0x10000000000 'monthly_20230201'
2023-02-01T00:00:01.854059+0100),19a6=snap(19a6 0x10000000000
'monthly_20230301' 
2023-03-01T00:00:01.215197+0100),1a24=snap(1a24
0x10000000000 'monthly_20230401'  ...) len=384

May 31 09:25:36 deltaweb055 ceph-mds[3268481]:
2023-05-31T09:25:36.076+0200 7f0e6becd700 10
mds.0.cache.ino(0x10004d74911) remove_client_cap last cap, 
leaving
realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2
snaps={185f=snap(185f 0x10000000000 'monthly_20221201'
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000
'monthly_20230101'  ...)

The daily_*, montly_* etc. names are the names of our regular
snapshots.

I posted a larger log file snippet using ceph-post-file with 
the ID:
da0eb93d-f340-4457-8a3f-434e8ef37d36

Is it possible that the MDS are trimming old snapshots without
taking care not to fill up the entire metadata pool?

Cheers
Janek
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
--

Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--

Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx