Hi Dan, Sorry, I meant Pacific. The version number was correct, the name wasn’t. ;-) Yes, I have five active MDS and five hot standbys. Static pinning isn’t really an options for our directory structure, so we’re using ephemeral pins. Janek > On 31. May 2023, at 18:44, Dan van der Ster <dan.vanderster@xxxxxxxxx> wrote: > > Hi Janek, > > A few questions and suggestions: > - Do you have multi-active MDS? In my experience back in nautilus if > something went wrong with mds export between mds's, the mds > log/journal could grow unbounded like you observed until that export > work was done. Static pinning could help if you are not using it > already. > - You definitely should disable the pg autoscaling on the mds metadata > pool (and other pools imho) -- decide the correct number of PGs for > your pools and leave it. > - Which version are you running? You said nautilus but wrote 16.2.12 > which is pacific... If you're running nautilus v14 then I recommend > disabling pg autoscaling completely -- IIRC it does not have a fix for > the OSD memory growth "pg dup" issue which can occur during PG > splitting/merging. > > Cheers, Dan > > ______________________________ > Clyso GmbH | https://www.clyso.com > > > On Wed, May 31, 2023 at 4:03 AM Janek Bevendorff > <janek.bevendorff@xxxxxxxxxxxxx> wrote: >> >> I checked our logs from yesterday, the PG scaling only started today, >> perhaps triggered by the snapshot trimming. I disabled it, but it didn't >> change anything. >> >> What did change something was restarting the MDS one by one, which had >> got far behind with trimming their caches and with a bunch of stuck ops. >> After restarting them, the pool size decreased quickly to 600GiB. I >> noticed the same behaviour yesterday, though yesterday is was more >> extreme and restarting the MDS took about an hour and I had to increase >> the heartbeat timeout. This time, it took only half a minute per MDS, >> probably because it wasn't that extreme yet and I had reduced the >> maximum cache size. Still looks like a bug to me. >> >> >> On 31/05/2023 11:18, Janek Bevendorff wrote: >>> Another thing I just noticed is that the auto-scaler is trying to >>> scale the pool down to 128 PGs. That could also result in large >>> fluctuations, but this big?? In any case, it looks like a bug to me. >>> Whatever is happening here, there should be safeguards with regard to >>> the pool's capacity. >>> >>> Here's the current state of the pool in ceph osd pool ls detail: >>> >>> pool 110 'cephfs.storage.meta' replicated size 4 min_size 3 crush_rule >>> 5 object_hash rjenkins pg_num 495 pgp_num 471 pg_num_target 128 >>> pgp_num_target 128 autoscale_mode on last_change 1359013 lfor >>> 0/1358620/1358618 flags hashpspool,nodelete stripe_width 0 >>> expected_num_objects 3000000 recovery_op_priority 5 recovery_priority >>> 2 application cephfs >>> >>> Janek >>> >>> >>> On 31/05/2023 10:10, Janek Bevendorff wrote: >>>> Forgot to add: We are still on Nautilus (16.2.12). >>>> >>>> >>>> On 31/05/2023 09:53, Janek Bevendorff wrote: >>>>> Hi, >>>>> >>>>> Perhaps this is a known issue and I was simply too dumb to find it, >>>>> but we are having problems with our CephFS metadata pool filling up >>>>> over night. >>>>> >>>>> Our cluster has a small SSD pool of around 15TB which hosts our >>>>> CephFS metadata pool. Usually, that's more than enough. The normal >>>>> size of the pool ranges between 200 and 800GiB (which is quite a lot >>>>> of fluctuation already). Yesterday, we had suddenly had the pool >>>>> fill up entirely and they only way to fix it was to add more >>>>> capacity. I increased the pool size to 18TB by adding more SSDs and >>>>> could resolve the problem. After a couple of hours of reshuffling, >>>>> the pool size finally went back to 230GiB. >>>>> >>>>> But then we had another fill-up tonight to 7.6TiB. Luckily, I had >>>>> adjusted the weights so that not all disks could fill up entirely >>>>> like last time, so it ended there. >>>>> >>>>> I wasn't really able to identify the problem yesterday, but under >>>>> the more controllable scenario today, I could check the MDS logs at >>>>> debug_mds=10 and to me it seems like the problem is caused by >>>>> snapshot trimming. The logs contain a lot of snapshot-related >>>>> messages for paths that haven't been touched in a long time. The >>>>> messages all look something like this: >>>>> >>>>> May 31 09:16:48 XXX ceph-mds[2947525]: 2023-05-31T09:16:48.292+0200 >>>>> 7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap first >>>>> cap, joining realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1 >>>>> b1b cps 2 snaps={185f=snap(185f 0x10000000000 'monthly_20221201' >>>>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 >>>>> 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 >>>>> 0x10000000000 ... >>>>> >>>>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.396+0200 >>>>> 7f0e6a6ca700 10 mds.0.cache | |______ 3 rep [dir >>>>> 0x100000218fe.101111101* /storage/REDACTED/| ptrwaiter=0 request=0 >>>>> child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 authpin=0 >>>>> tempexporting=0 0x5607759d9600] >>>>> >>>>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.452+0200 >>>>> 7f0e6a6ca700 10 mds.0.cache | | |____ 4 rep [dir >>>>> 0x100000ff904.100111101010* /storage/REDACTED/| ptrwaiter=0 >>>>> request=0 child=0 frozen=0 subtree=1 importing=0 replicated=0 >>>>> waiter=0 authpin=0 tempexporting=0 0x56034ed25200] >>>>> >>>>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.716+0200 >>>>> 7f0e6becd700 10 mds.0.server set_trace_dist snaprealm >>>>> snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2 >>>>> snaps={185f=snap(185f 0x10000000000 'monthly_20221201' >>>>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 >>>>> 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 >>>>> 0x10000000000 'monthly_20230201' >>>>> 2023-02-01T00:00:01.854059+0100),19a6=snap(19a6 0x10000000000 >>>>> 'monthly_20230301' 2023-03-01T00:00:01.215197+0100),1a24=snap(1a24 >>>>> 0x10000000000 'monthly_20230401' ...) len=384 >>>>> >>>>> May 31 09:25:36 deltaweb055 ceph-mds[3268481]: >>>>> 2023-05-31T09:25:36.076+0200 7f0e6becd700 10 >>>>> mds.0.cache.ino(0x10004d74911) remove_client_cap last cap, leaving >>>>> realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2 >>>>> snaps={185f=snap(185f 0x10000000000 'monthly_20221201' >>>>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 >>>>> 'monthly_20230101' ...) >>>>> >>>>> The daily_*, montly_* etc. names are the names of our regular >>>>> snapshots. >>>>> >>>>> I posted a larger log file snippet using ceph-post-file with the ID: >>>>> da0eb93d-f340-4457-8a3f-434e8ef37d36 >>>>> >>>>> Is it possible that the MDS are trimming old snapshots without >>>>> taking care not to fill up the entire metadata pool? >>>>> >>>>> Cheers >>>>> Janek >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> >> -- >> >> Bauhaus-Universität Weimar >> Bauhausstr. 9a, R308 >> 99423 Weimar, Germany >> >> Phone: +49 3643 58 3577 >> www.webis.de >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx