Balancer blocked as autoscaler not acting on scaling change

bc10@xxxxxxxxxxxx · Fri, 22 Sep 2023 09:22:39 -0000

Hi Folks,

We are currently running with one nearfull OSD and 15 nearfull pools. The most full OSD is about 86% full but the average is 58% full. However, the balancer is skipping a pool on which the autoscaler is trying to complete a pg_num reduction from 131,072 to 32,768 (default.rgw.buckets.data pool). However, the autoscaler has been working on this for the last 20 days, it works through a list of objects that are misplaced but when it gets close to the end, more objects get added to the list.

This morning I observed the list get down to c. 7,000 objects misplaced with 2 PGs active+remapped+backfilling, one PG completed the backfilling then the list shot up to c. 70,000 objects misplaced with 3 PGs active+remapped+backfilling.

Has anyone come across this behaviour before? If so, what was your remediation?

Thanks in advance for sharing.
Bruno

Cluster details:
3,068 OSDs when all running, c. 60 per storage node
OS: Ubuntu 20.04
Ceph: Pacific 16.2.13 from Ubuntu Cloud Archive

Use case:
S3 storage and OpenStack backend, all pools three-way replicated
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx