Re: [ceph-users] jj's "improved" ceph balancer

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Wed, 20 Oct 2021 21:41:55 +0200

Hi,

I don't quite understand your "huge server" scenario, other than a basic understanding that the balancer cannot do magic in some impossible cases.

But anyway, I wonder if this sort of higher order balancing could/should be added as a "part two" to the mgr balancer. The existing code does a quite good job in many (dare I say most?) cases. E.g. it even balances empty clusters perfectly.
But after it cannot find a further optimization, maybe a heuristic like yours can further refine the placement...

 Dan

On Wed, 20 Oct 2021, 20:52 Jonas Jelten, <jelten@xxxxxxxxx> wrote:
Hi Dan,

I'm not kidding, these were real-world observations, hence my motivation to create this balancer :)

First I tried "fixing" the mgr balancer, but after understanding the exact algorithm there I thought of a completely different approach.

For us the main reason things got out of balance was this (from the README):

> To make things worse, if there's a huge server in the cluster which is so big, CRUSH can't place data often enough on it to fill it to the same level as any other server, the balancer will fail moving PGs across servers that actually would have space.

> This happens since it sees only this server's OSDs as "underfull", but each PG has one shard on that server already, so no data can be moved on it.

But all the aspects in that section play together, and I don't think it's easily improvable in mgr-balancer while keeping the same base algorithm.

Cheers

  -- Jonas

On 20/10/2021 19.55, Dan van der Ster wrote:

> Hi Jonas,

> 

> From your readme:

> 

> "the best possible solution is some OSDs having an offset of 1 PG to the ideal count. As a PG-distribution-optimization is done per pool, without checking other pool's distribution at all, some devices will be the +1 more often than others. At worst one OSD is the +1 for each pool in the cluster."

> 

> That's an interesting observation/flaw which hadn't occurred to me before. I think we don't ever see it in practice in our clusters because we do not have multiple large pools on the same osds.

> 

> How large are the variances in your real clusters? I hope the example in your readme isn't from real life??

> 

> Cheers, Dan

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx