Re: Balancer blocked as autoscaler not acting on scaling change

Joachim Kraftmayer - ceph ambassador <joachim.kraftmayer@xxxxxxxxx> · Wed, 4 Oct 2023 09:55:13 +0200

Hi,
we have often seen strange behavior and also interesting pg targets from 
pg_autoscaler in the last years.
That's why we disable it globally.

The commands:
ceph osd reweight-by-utilization
ceph osd test-reweight-by-utilization
are from the time before the upmap balancer was introduced and did not 
solve the problem in the long run in an active cluster.

That the balancer skips the pool with the rebalancing, I've seen more 
than once.

Why pg_autoscaler behaves this way would have to be analyzed in more 
detail. As mentioned above, we normally turn it off.

The idea of Eugen would help when the pool rebalancing is done.

Joachim

___________________________________
ceph ambassador DACH
ceph consultant since 2012

Clyso GmbH - Premier Ceph Foundation Member

https://www.clyso.com/

Am 22.09.23 um 11:22 schrieb bc10@xxxxxxxxxxxx:
Hi Folks,

We are currently running with one nearfull OSD and 15 nearfull pools. The most full OSD is about 86% full but the average is 58% full. However, the balancer is skipping a pool on which the autoscaler is trying to complete a pg_num reduction from 131,072 to 32,768 (default.rgw.buckets.data pool). However, the autoscaler has been working on this for the last 20 days, it works through a list of objects that are misplaced but when it gets close to the end, more objects get added to the list.

This morning I observed the list get down to c. 7,000 objects misplaced with 2 PGs active+remapped+backfilling, one PG completed the backfilling then the list shot up to c. 70,000 objects misplaced with 3 PGs active+remapped+backfilling.

Has anyone come across this behaviour before? If so, what was your remediation?

Thanks in advance for sharing.
Bruno
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx