Dear Ceph users,
my cluster is stuck since several days with some PG backfilling. The
number of misplaced objects slowly decreases down to 5%, and at that
point jumps up again to about 7%, and so on. I found several possible
reasons for this behavior. One is related to the balancer, which anyway
I think is not operating:
# ceph balancer status
{
"active": false,
"last_optimize_duration": "0:00:00.000938",
"last_optimize_started": "Thu Oct 6 16:19:59 2022",
"mode": "upmap",
"optimize_result": "Too many objects (0.071539 > 0.050000) are
misplaced; try again later",
"plans": []
}
(the lase optimize result is from yesterday when I disabled it, and
since then the backfill loop has happened several times).
Another possible reason seems to be an imbalance of PG and PGB numbers.
Effectively I found such an imbalance on one of my pools:
# ceph osd pool get wizard_data pg_num
pg_num: 128
# ceph osd pool get wizard_data pgp_num
pgp_num: 123
but I cannot fix it:
# ceph osd pool set wizard_data pgp_num 128
set pool 3 pgp_num to 128
# ceph osd pool get wizard_data pgp_num
pgp_num: 123
The autoscaler is off for that pool:
POOL SIZE TARGET SIZE RATE RAW CAPACITY
RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM
AUTOSCALE BULK
wizard_data 8951G 1.3333333730697632 152.8T
0.0763 1.0 128 off
False
so I don't understand why the PGP number is stuck at 123.
Thanks in advance for any help,
Nicola
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx