Iinfinite backfill loop + number of pgp groups stuck at wrong value

Nicola Mori <mori@xxxxxxxxxx> · Fri, 7 Oct 2022 09:15:02 +0200

Dear Ceph users,

my cluster is stuck since several days with some PG backfilling. The 
number of misplaced objects slowly decreases down to 5%, and at that 
point jumps up again to about 7%, and so on. I found several possible 
reasons for this behavior. One is related to the balancer, which anyway 
I think is not operating:

# ceph balancer status
{
    "active": false,
    "last_optimize_duration": "0:00:00.000938",
    "last_optimize_started": "Thu Oct  6 16:19:59 2022",
    "mode": "upmap",
    "optimize_result": "Too many objects (0.071539 > 0.050000) are 
misplaced; try again later",
    "plans": []
}

(the lase optimize result is from yesterday when I disabled it, and 
since then the backfill loop has happened several times).
Another possible reason seems to be an imbalance of PG and PGB  numbers. 
Effectively I found such an imbalance on one of my pools:

# ceph osd pool get wizard_data pg_num
pg_num: 128
# ceph osd pool get wizard_data pgp_num
pgp_num: 123

but I cannot fix it:
# ceph osd pool set wizard_data pgp_num 128
set pool 3 pgp_num to 128
# ceph osd pool get wizard_data pgp_num
pgp_num: 123

The autoscaler is off for that pool:

POOL               SIZE  TARGET SIZE                RATE  RAW CAPACITY 
RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM 
AUTOSCALE  BULK
wizard_data       8951G               1.3333333730697632        152.8T 
0.0763                                  1.0     128              off 
   False

so I don't understand why the PGP number is stuck at 123.
Thanks in advance for any help,

Nicola
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx