Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> · Wed, 31 May 2023 21:57:51 +0200

Hi Dan,

Sorry, I meant Pacific. The version number was correct, the name wasn’t. ;-)

Yes, I have five active MDS and five hot standbys. Static pinning isn’t really an options for our directory structure, so we’re using ephemeral pins.

Janek

> On 31. May 2023, at 18:44, Dan van der Ster <dan.vanderster@xxxxxxxxx> wrote:
> 
> Hi Janek,
> 
> A few questions and suggestions:
> - Do you have multi-active MDS? In my experience back in nautilus if
> something went wrong with mds export between mds's, the mds
> log/journal could grow unbounded like you observed until that export
> work was done. Static pinning could help if you are not using it
> already.
> - You definitely should disable the pg autoscaling on the mds metadata
> pool (and other pools imho) -- decide the correct number of PGs for
> your pools and leave it.
> - Which version are you running? You said nautilus but wrote 16.2.12
> which is pacific... If you're running nautilus v14 then I recommend
> disabling pg autoscaling completely -- IIRC it does not have a fix for
> the OSD memory growth "pg dup" issue which can occur during PG
> splitting/merging.
> 
> Cheers, Dan
> 
> ______________________________
> Clyso GmbH | https://www.clyso.com
> 
> 
> On Wed, May 31, 2023 at 4:03 AM Janek Bevendorff
> <janek.bevendorff@xxxxxxxxxxxxx> wrote:
>> 
>> I checked our logs from yesterday, the PG scaling only started today,
>> perhaps triggered by the snapshot trimming. I disabled it, but it didn't
>> change anything.
>> 
>> What did change something was restarting the MDS one by one, which had
>> got far behind with trimming their caches and with a bunch of stuck ops.
>> After restarting them, the pool size decreased quickly to 600GiB. I
>> noticed the same behaviour yesterday, though yesterday is was more
>> extreme and restarting the MDS took about an hour and I had to increase
>> the heartbeat timeout. This time, it took only half a minute per MDS,
>> probably because it wasn't that extreme yet and I had reduced the
>> maximum cache size. Still looks like a bug to me.
>> 
>> 
>> On 31/05/2023 11:18, Janek Bevendorff wrote:
>>> Another thing I just noticed is that the auto-scaler is trying to
>>> scale the pool down to 128 PGs. That could also result in large
>>> fluctuations, but this big?? In any case, it looks like a bug to me.
>>> Whatever is happening here, there should be safeguards with regard to
>>> the pool's capacity.
>>> 
>>> Here's the current state of the pool in ceph osd pool ls detail:
>>> 
>>> pool 110 'cephfs.storage.meta' replicated size 4 min_size 3 crush_rule
>>> 5 object_hash rjenkins pg_num 495 pgp_num 471 pg_num_target 128
>>> pgp_num_target 128 autoscale_mode on last_change 1359013 lfor
>>> 0/1358620/1358618 flags hashpspool,nodelete stripe_width 0
>>> expected_num_objects 3000000 recovery_op_priority 5 recovery_priority
>>> 2 application cephfs
>>> 
>>> Janek
>>> 
>>> 
>>> On 31/05/2023 10:10, Janek Bevendorff wrote:
>>>> Forgot to add: We are still on Nautilus (16.2.12).
>>>> 
>>>> 
>>>> On 31/05/2023 09:53, Janek Bevendorff wrote:
>>>>> Hi,
>>>>> 
>>>>> Perhaps this is a known issue and I was simply too dumb to find it,
>>>>> but we are having problems with our CephFS metadata pool filling up
>>>>> over night.
>>>>> 
>>>>> Our cluster has a small SSD pool of around 15TB which hosts our
>>>>> CephFS metadata pool. Usually, that's more than enough. The normal
>>>>> size of the pool ranges between 200 and 800GiB (which is quite a lot
>>>>> of fluctuation already). Yesterday, we had suddenly had the pool
>>>>> fill up entirely and they only way to fix it was to add more
>>>>> capacity. I increased the pool size to 18TB by adding more SSDs and
>>>>> could resolve the problem. After a couple of hours of reshuffling,
>>>>> the pool size finally went back to 230GiB.
>>>>> 
>>>>> But then we had another fill-up tonight to 7.6TiB. Luckily, I had
>>>>> adjusted the weights so that not all disks could fill up entirely
>>>>> like last time, so it ended there.
>>>>> 
>>>>> I wasn't really able to identify the problem yesterday, but under
>>>>> the more controllable scenario today, I could check the MDS logs at
>>>>> debug_mds=10 and to me it seems like the problem is caused by
>>>>> snapshot trimming. The logs contain a lot of snapshot-related
>>>>> messages for paths that haven't been touched in a long time. The
>>>>> messages all look something like this:
>>>>> 
>>>>> May 31 09:16:48 XXX ceph-mds[2947525]: 2023-05-31T09:16:48.292+0200
>>>>> 7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap first
>>>>> cap, joining realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1
>>>>> b1b cps 2 snaps={185f=snap(185f 0x10000000000 'monthly_20221201'
>>>>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000
>>>>> 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941
>>>>> 0x10000000000 ...
>>>>> 
>>>>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.396+0200
>>>>> 7f0e6a6ca700 10 mds.0.cache | |______ 3     rep [dir
>>>>> 0x100000218fe.101111101* /storage/REDACTED/| ptrwaiter=0 request=0
>>>>> child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 authpin=0
>>>>> tempexporting=0 0x5607759d9600]
>>>>> 
>>>>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.452+0200
>>>>> 7f0e6a6ca700 10 mds.0.cache | | |____ 4     rep [dir
>>>>> 0x100000ff904.100111101010* /storage/REDACTED/| ptrwaiter=0
>>>>> request=0 child=0 frozen=0 subtree=1 importing=0 replicated=0
>>>>> waiter=0 authpin=0 tempexporting=0 0x56034ed25200]
>>>>> 
>>>>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.716+0200
>>>>> 7f0e6becd700 10 mds.0.server set_trace_dist snaprealm
>>>>> snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2
>>>>> snaps={185f=snap(185f 0x10000000000 'monthly_20221201'
>>>>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000
>>>>> 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941
>>>>> 0x10000000000 'monthly_20230201'
>>>>> 2023-02-01T00:00:01.854059+0100),19a6=snap(19a6 0x10000000000
>>>>> 'monthly_20230301' 2023-03-01T00:00:01.215197+0100),1a24=snap(1a24
>>>>> 0x10000000000 'monthly_20230401'  ...) len=384
>>>>> 
>>>>> May 31 09:25:36 deltaweb055 ceph-mds[3268481]:
>>>>> 2023-05-31T09:25:36.076+0200 7f0e6becd700 10
>>>>> mds.0.cache.ino(0x10004d74911) remove_client_cap last cap, leaving
>>>>> realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2
>>>>> snaps={185f=snap(185f 0x10000000000 'monthly_20221201'
>>>>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000
>>>>> 'monthly_20230101'  ...)
>>>>> 
>>>>> The daily_*, montly_* etc. names are the names of our regular
>>>>> snapshots.
>>>>> 
>>>>> I posted a larger log file snippet using ceph-post-file with the ID:
>>>>> da0eb93d-f340-4457-8a3f-434e8ef37d36
>>>>> 
>>>>> Is it possible that the MDS are trimming old snapshots without
>>>>> taking care not to fill up the entire metadata pool?
>>>>> 
>>>>> Cheers
>>>>> Janek
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> 
>> --
>> 
>> Bauhaus-Universität Weimar
>> Bauhausstr. 9a, R308
>> 99423 Weimar, Germany
>> 
>> Phone: +49 3643 58 3577
>> www.webis.de
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx