Iinfinite backfill loop + number of pgp groups stuck at wrong value

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear Ceph users,

my cluster is stuck since several days with some PG backfilling. The number of misplaced objects slowly decreases down to 5%, and at that point jumps up again to about 7%, and so on. I found several possible reasons for this behavior. One is related to the balancer, which anyway I think is not operating:

# ceph balancer status
{
    "active": false,
    "last_optimize_duration": "0:00:00.000938",
    "last_optimize_started": "Thu Oct  6 16:19:59 2022",
    "mode": "upmap",
"optimize_result": "Too many objects (0.071539 > 0.050000) are misplaced; try again later",
    "plans": []
}

(the lase optimize result is from yesterday when I disabled it, and since then the backfill loop has happened several times). Another possible reason seems to be an imbalance of PG and PGB numbers. Effectively I found such an imbalance on one of my pools:

# ceph osd pool get wizard_data pg_num
pg_num: 128
# ceph osd pool get wizard_data pgp_num
pgp_num: 123

but I cannot fix it:
# ceph osd pool set wizard_data pgp_num 128
set pool 3 pgp_num to 128
# ceph osd pool get wizard_data pgp_num
pgp_num: 123

The autoscaler is off for that pool:

POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK wizard_data 8951G 1.3333333730697632 152.8T 0.0763 1.0 128 off False

so I don't understand why the PGP number is stuck at 123.
Thanks in advance for any help,

Nicola
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux