Hi Janek, A few questions and suggestions: - Do you have multi-active MDS? In my experience back in nautilus if something went wrong with mds export between mds's, the mds log/journal could grow unbounded like you observed until that export work was done. Static pinning could help if you are not using it already. - You definitely should disable the pg autoscaling on the mds metadata pool (and other pools imho) -- decide the correct number of PGs for your pools and leave it. - Which version are you running? You said nautilus but wrote 16.2.12 which is pacific... If you're running nautilus v14 then I recommend disabling pg autoscaling completely -- IIRC it does not have a fix for the OSD memory growth "pg dup" issue which can occur during PG splitting/merging. Cheers, Dan ______________________________ Clyso GmbH | https://www.clyso.com On Wed, May 31, 2023 at 4:03 AM Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> wrote: > > I checked our logs from yesterday, the PG scaling only started today, > perhaps triggered by the snapshot trimming. I disabled it, but it didn't > change anything. > > What did change something was restarting the MDS one by one, which had > got far behind with trimming their caches and with a bunch of stuck ops. > After restarting them, the pool size decreased quickly to 600GiB. I > noticed the same behaviour yesterday, though yesterday is was more > extreme and restarting the MDS took about an hour and I had to increase > the heartbeat timeout. This time, it took only half a minute per MDS, > probably because it wasn't that extreme yet and I had reduced the > maximum cache size. Still looks like a bug to me. > > > On 31/05/2023 11:18, Janek Bevendorff wrote: > > Another thing I just noticed is that the auto-scaler is trying to > > scale the pool down to 128 PGs. That could also result in large > > fluctuations, but this big?? In any case, it looks like a bug to me. > > Whatever is happening here, there should be safeguards with regard to > > the pool's capacity. > > > > Here's the current state of the pool in ceph osd pool ls detail: > > > > pool 110 'cephfs.storage.meta' replicated size 4 min_size 3 crush_rule > > 5 object_hash rjenkins pg_num 495 pgp_num 471 pg_num_target 128 > > pgp_num_target 128 autoscale_mode on last_change 1359013 lfor > > 0/1358620/1358618 flags hashpspool,nodelete stripe_width 0 > > expected_num_objects 3000000 recovery_op_priority 5 recovery_priority > > 2 application cephfs > > > > Janek > > > > > > On 31/05/2023 10:10, Janek Bevendorff wrote: > >> Forgot to add: We are still on Nautilus (16.2.12). > >> > >> > >> On 31/05/2023 09:53, Janek Bevendorff wrote: > >>> Hi, > >>> > >>> Perhaps this is a known issue and I was simply too dumb to find it, > >>> but we are having problems with our CephFS metadata pool filling up > >>> over night. > >>> > >>> Our cluster has a small SSD pool of around 15TB which hosts our > >>> CephFS metadata pool. Usually, that's more than enough. The normal > >>> size of the pool ranges between 200 and 800GiB (which is quite a lot > >>> of fluctuation already). Yesterday, we had suddenly had the pool > >>> fill up entirely and they only way to fix it was to add more > >>> capacity. I increased the pool size to 18TB by adding more SSDs and > >>> could resolve the problem. After a couple of hours of reshuffling, > >>> the pool size finally went back to 230GiB. > >>> > >>> But then we had another fill-up tonight to 7.6TiB. Luckily, I had > >>> adjusted the weights so that not all disks could fill up entirely > >>> like last time, so it ended there. > >>> > >>> I wasn't really able to identify the problem yesterday, but under > >>> the more controllable scenario today, I could check the MDS logs at > >>> debug_mds=10 and to me it seems like the problem is caused by > >>> snapshot trimming. The logs contain a lot of snapshot-related > >>> messages for paths that haven't been touched in a long time. The > >>> messages all look something like this: > >>> > >>> May 31 09:16:48 XXX ceph-mds[2947525]: 2023-05-31T09:16:48.292+0200 > >>> 7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap first > >>> cap, joining realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1 > >>> b1b cps 2 snaps={185f=snap(185f 0x10000000000 'monthly_20221201' > >>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 > >>> 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 > >>> 0x10000000000 ... > >>> > >>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.396+0200 > >>> 7f0e6a6ca700 10 mds.0.cache | |______ 3 rep [dir > >>> 0x100000218fe.101111101* /storage/REDACTED/| ptrwaiter=0 request=0 > >>> child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 authpin=0 > >>> tempexporting=0 0x5607759d9600] > >>> > >>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.452+0200 > >>> 7f0e6a6ca700 10 mds.0.cache | | |____ 4 rep [dir > >>> 0x100000ff904.100111101010* /storage/REDACTED/| ptrwaiter=0 > >>> request=0 child=0 frozen=0 subtree=1 importing=0 replicated=0 > >>> waiter=0 authpin=0 tempexporting=0 0x56034ed25200] > >>> > >>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.716+0200 > >>> 7f0e6becd700 10 mds.0.server set_trace_dist snaprealm > >>> snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2 > >>> snaps={185f=snap(185f 0x10000000000 'monthly_20221201' > >>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 > >>> 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 > >>> 0x10000000000 'monthly_20230201' > >>> 2023-02-01T00:00:01.854059+0100),19a6=snap(19a6 0x10000000000 > >>> 'monthly_20230301' 2023-03-01T00:00:01.215197+0100),1a24=snap(1a24 > >>> 0x10000000000 'monthly_20230401' ...) len=384 > >>> > >>> May 31 09:25:36 deltaweb055 ceph-mds[3268481]: > >>> 2023-05-31T09:25:36.076+0200 7f0e6becd700 10 > >>> mds.0.cache.ino(0x10004d74911) remove_client_cap last cap, leaving > >>> realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2 > >>> snaps={185f=snap(185f 0x10000000000 'monthly_20221201' > >>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000 > >>> 'monthly_20230101' ...) > >>> > >>> The daily_*, montly_* etc. names are the names of our regular > >>> snapshots. > >>> > >>> I posted a larger log file snippet using ceph-post-file with the ID: > >>> da0eb93d-f340-4457-8a3f-434e8ef37d36 > >>> > >>> Is it possible that the MDS are trimming old snapshots without > >>> taking care not to fill up the entire metadata pool? > >>> > >>> Cheers > >>> Janek > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users@xxxxxxx > >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > -- > > Bauhaus-Universität Weimar > Bauhausstr. 9a, R308 > 99423 Weimar, Germany > > Phone: +49 3643 58 3577 > www.webis.de > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx