Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

Dan van der Ster <dan.vanderster@xxxxxxxxx> · Wed, 31 May 2023 09:44:38 -0700

Hi Janek,

A few questions and suggestions:
- Do you have multi-active MDS? In my experience back in nautilus if
something went wrong with mds export between mds's, the mds
log/journal could grow unbounded like you observed until that export
work was done. Static pinning could help if you are not using it
already.
- You definitely should disable the pg autoscaling on the mds metadata
pool (and other pools imho) -- decide the correct number of PGs for
your pools and leave it.
- Which version are you running? You said nautilus but wrote 16.2.12
which is pacific... If you're running nautilus v14 then I recommend
disabling pg autoscaling completely -- IIRC it does not have a fix for
the OSD memory growth "pg dup" issue which can occur during PG
splitting/merging.

Cheers, Dan

______________________________
Clyso GmbH | https://www.clyso.com

On Wed, May 31, 2023 at 4:03 AM Janek Bevendorff
<janek.bevendorff@xxxxxxxxxxxxx> wrote:
>
> I checked our logs from yesterday, the PG scaling only started today,
> perhaps triggered by the snapshot trimming. I disabled it, but it didn't
> change anything.
>
> What did change something was restarting the MDS one by one, which had
> got far behind with trimming their caches and with a bunch of stuck ops.
> After restarting them, the pool size decreased quickly to 600GiB. I
> noticed the same behaviour yesterday, though yesterday is was more
> extreme and restarting the MDS took about an hour and I had to increase
> the heartbeat timeout. This time, it took only half a minute per MDS,
> probably because it wasn't that extreme yet and I had reduced the
> maximum cache size. Still looks like a bug to me.
>
>
> On 31/05/2023 11:18, Janek Bevendorff wrote:
> > Another thing I just noticed is that the auto-scaler is trying to
> > scale the pool down to 128 PGs. That could also result in large
> > fluctuations, but this big?? In any case, it looks like a bug to me.
> > Whatever is happening here, there should be safeguards with regard to
> > the pool's capacity.
> >
> > Here's the current state of the pool in ceph osd pool ls detail:
> >
> > pool 110 'cephfs.storage.meta' replicated size 4 min_size 3 crush_rule
> > 5 object_hash rjenkins pg_num 495 pgp_num 471 pg_num_target 128
> > pgp_num_target 128 autoscale_mode on last_change 1359013 lfor
> > 0/1358620/1358618 flags hashpspool,nodelete stripe_width 0
> > expected_num_objects 3000000 recovery_op_priority 5 recovery_priority
> > 2 application cephfs
> >
> > Janek
> >
> >
> > On 31/05/2023 10:10, Janek Bevendorff wrote:
> >> Forgot to add: We are still on Nautilus (16.2.12).
> >>
> >>
> >> On 31/05/2023 09:53, Janek Bevendorff wrote:
> >>> Hi,
> >>>
> >>> Perhaps this is a known issue and I was simply too dumb to find it,
> >>> but we are having problems with our CephFS metadata pool filling up
> >>> over night.
> >>>
> >>> Our cluster has a small SSD pool of around 15TB which hosts our
> >>> CephFS metadata pool. Usually, that's more than enough. The normal
> >>> size of the pool ranges between 200 and 800GiB (which is quite a lot
> >>> of fluctuation already). Yesterday, we had suddenly had the pool
> >>> fill up entirely and they only way to fix it was to add more
> >>> capacity. I increased the pool size to 18TB by adding more SSDs and
> >>> could resolve the problem. After a couple of hours of reshuffling,
> >>> the pool size finally went back to 230GiB.
> >>>
> >>> But then we had another fill-up tonight to 7.6TiB. Luckily, I had
> >>> adjusted the weights so that not all disks could fill up entirely
> >>> like last time, so it ended there.
> >>>
> >>> I wasn't really able to identify the problem yesterday, but under
> >>> the more controllable scenario today, I could check the MDS logs at
> >>> debug_mds=10 and to me it seems like the problem is caused by
> >>> snapshot trimming. The logs contain a lot of snapshot-related
> >>> messages for paths that haven't been touched in a long time. The
> >>> messages all look something like this:
> >>>
> >>> May 31 09:16:48 XXX ceph-mds[2947525]: 2023-05-31T09:16:48.292+0200
> >>> 7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap first
> >>> cap, joining realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1
> >>> b1b cps 2 snaps={185f=snap(185f 0x10000000000 'monthly_20221201'
> >>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000
> >>> 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941
> >>> 0x10000000000 ...
> >>>
> >>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.396+0200
> >>> 7f0e6a6ca700 10 mds.0.cache | |______ 3     rep [dir
> >>> 0x100000218fe.101111101* /storage/REDACTED/| ptrwaiter=0 request=0
> >>> child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 authpin=0
> >>> tempexporting=0 0x5607759d9600]
> >>>
> >>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.452+0200
> >>> 7f0e6a6ca700 10 mds.0.cache | | |____ 4     rep [dir
> >>> 0x100000ff904.100111101010* /storage/REDACTED/| ptrwaiter=0
> >>> request=0 child=0 frozen=0 subtree=1 importing=0 replicated=0
> >>> waiter=0 authpin=0 tempexporting=0 0x56034ed25200]
> >>>
> >>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.716+0200
> >>> 7f0e6becd700 10 mds.0.server set_trace_dist snaprealm
> >>> snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2
> >>> snaps={185f=snap(185f 0x10000000000 'monthly_20221201'
> >>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000
> >>> 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941
> >>> 0x10000000000 'monthly_20230201'
> >>> 2023-02-01T00:00:01.854059+0100),19a6=snap(19a6 0x10000000000
> >>> 'monthly_20230301' 2023-03-01T00:00:01.215197+0100),1a24=snap(1a24
> >>> 0x10000000000 'monthly_20230401'  ...) len=384
> >>>
> >>> May 31 09:25:36 deltaweb055 ceph-mds[3268481]:
> >>> 2023-05-31T09:25:36.076+0200 7f0e6becd700 10
> >>> mds.0.cache.ino(0x10004d74911) remove_client_cap last cap, leaving
> >>> realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2
> >>> snaps={185f=snap(185f 0x10000000000 'monthly_20221201'
> >>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000
> >>> 'monthly_20230101'  ...)
> >>>
> >>> The daily_*, montly_* etc. names are the names of our regular
> >>> snapshots.
> >>>
> >>> I posted a larger log file snippet using ceph-post-file with the ID:
> >>> da0eb93d-f340-4457-8a3f-434e8ef37d36
> >>>
> >>> Is it possible that the MDS are trimming old snapshots without
> >>> taking care not to fill up the entire metadata pool?
> >>>
> >>> Cheers
> >>> Janek
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
> --
>
> Bauhaus-Universität Weimar
> Bauhausstr. 9a, R308
> 99423 Weimar, Germany
>
> Phone: +49 3643 58 3577
> www.webis.de
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx