Hi Nicola, its not noise. Even though the modules seem disabled and pool flags are set to false, they still linger around in the background and interfere. See the recent thread https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/WST6K5A4UQGGISBFGJEZS4HFL2VVWW32/ . With all the settings you have, the last one would be setting ceph config set mgr target_max_misplaced_ratio 1 and all the balancer- and scaling modules will just do what you tell them, assuming you know what you are doing. I restored default behaviour with instant application of changes and don't have any problems with it. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Josh Baergen <jbaergen@xxxxxxxxxxxxxxxx> Sent: 07 October 2022 17:16:49 To: Nicola Mori Cc: ceph-users Subject: Re: Iinfinite backfill loop + number of pgp groups stuck at wrong value As of Nautilus+, when you set pg_num, it actually internally sets pg(p)_num_target, and then slowly increases (or decreases, if you're merging) pg_num and then pgp_num until it reaches the target. The amount of backfill scheduled into the system is controlled by target_max_misplaced_ratio. Josh On Fri, Oct 7, 2022 at 3:50 AM Nicola Mori <mori@xxxxxxxxxx> wrote: > > The situation got solved by itself, since probably there was no error. I > manually increased the number of PGs and PGPs to 128 some days ago, and > the PGP count was being updated step by step. Actually after a bump from > 5% to 7% in the count of misplaced object I noticed that the number of > PGPs was updated to 126, and after a last bump it is now at 128 with a > ~4% of misplaced objects currently decreasing. > Sorry for the noise, > > Nicola > > On 07/10/22 09:15, Nicola Mori wrote: > > Dear Ceph users, > > > > my cluster is stuck since several days with some PG backfilling. The > > number of misplaced objects slowly decreases down to 5%, and at that > > point jumps up again to about 7%, and so on. I found several possible > > reasons for this behavior. One is related to the balancer, which anyway > > I think is not operating: > > > > # ceph balancer status > > { > > "active": false, > > "last_optimize_duration": "0:00:00.000938", > > "last_optimize_started": "Thu Oct 6 16:19:59 2022", > > "mode": "upmap", > > "optimize_result": "Too many objects (0.071539 > 0.050000) are > > misplaced; try again later", > > "plans": [] > > } > > > > (the lase optimize result is from yesterday when I disabled it, and > > since then the backfill loop has happened several times). > > Another possible reason seems to be an imbalance of PG and PGB numbers. > > Effectively I found such an imbalance on one of my pools: > > > > # ceph osd pool get wizard_data pg_num > > pg_num: 128 > > # ceph osd pool get wizard_data pgp_num > > pgp_num: 123 > > > > but I cannot fix it: > > # ceph osd pool set wizard_data pgp_num 128 > > set pool 3 pgp_num to 128 > > # ceph osd pool get wizard_data pgp_num > > pgp_num: 123 > > > > The autoscaler is off for that pool: > > > > POOL SIZE TARGET SIZE RATE RAW CAPACITY > > RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM > > AUTOSCALE BULK > > wizard_data 8951G 1.3333333730697632 152.8T > > 0.0763 1.0 128 off > > False > > > > so I don't understand why the PGP number is stuck at 123. > > Thanks in advance for any help, > > > > Nicola > > -- > Nicola Mori, Ph.D. > INFN sezione di Firenze > Via Bruno Rossi 1, 50019 Sesto F.no (Italy) > +390554572660 > mori@xxxxxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx