Re: Iinfinite backfill loop + number of pgp groups stuck at wrong value

Josh Baergen <jbaergen@xxxxxxxxxxxxxxxx> · Fri, 7 Oct 2022 09:16:49 -0600

As of Nautilus+, when you set pg_num, it actually internally sets
pg(p)_num_target, and then slowly increases (or decreases, if you're
merging) pg_num and then pgp_num until it reaches the target. The
amount of backfill scheduled into the system is controlled by
target_max_misplaced_ratio.

Josh

On Fri, Oct 7, 2022 at 3:50 AM Nicola Mori <mori@xxxxxxxxxx> wrote:
>
> The situation got solved by itself, since probably there was no error. I
> manually increased the number of PGs and PGPs to 128 some days ago, and
> the PGP count was being updated step by step. Actually after a bump from
> 5% to 7% in the count of misplaced object I noticed that the number of
> PGPs was updated to 126, and after a last bump it is now at 128 with a
> ~4% of misplaced objects currently decreasing.
> Sorry for the noise,
>
> Nicola
>
> On 07/10/22 09:15, Nicola Mori wrote:
> > Dear Ceph users,
> >
> > my cluster is stuck since several days with some PG backfilling. The
> > number of misplaced objects slowly decreases down to 5%, and at that
> > point jumps up again to about 7%, and so on. I found several possible
> > reasons for this behavior. One is related to the balancer, which anyway
> > I think is not operating:
> >
> > # ceph balancer status
> > {
> >      "active": false,
> >      "last_optimize_duration": "0:00:00.000938",
> >      "last_optimize_started": "Thu Oct  6 16:19:59 2022",
> >      "mode": "upmap",
> >      "optimize_result": "Too many objects (0.071539 > 0.050000) are
> > misplaced; try again later",
> >      "plans": []
> > }
> >
> > (the lase optimize result is from yesterday when I disabled it, and
> > since then the backfill loop has happened several times).
> > Another possible reason seems to be an imbalance of PG and PGB  numbers.
> > Effectively I found such an imbalance on one of my pools:
> >
> > # ceph osd pool get wizard_data pg_num
> > pg_num: 128
> > # ceph osd pool get wizard_data pgp_num
> > pgp_num: 123
> >
> > but I cannot fix it:
> > # ceph osd pool set wizard_data pgp_num 128
> > set pool 3 pgp_num to 128
> > # ceph osd pool get wizard_data pgp_num
> > pgp_num: 123
> >
> > The autoscaler is off for that pool:
> >
> > POOL               SIZE  TARGET SIZE                RATE  RAW CAPACITY
> > RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM
> > AUTOSCALE  BULK
> > wizard_data       8951G               1.3333333730697632        152.8T
> > 0.0763                                  1.0     128              off
> > False
> >
> > so I don't understand why the PGP number is stuck at 123.
> > Thanks in advance for any help,
> >
> > Nicola
>
> --
> Nicola Mori, Ph.D.
> INFN sezione di Firenze
> Via Bruno Rossi 1, 50019 Sesto F.no (Italy)
> +390554572660
> mori@xxxxxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx