Re: jj's "improved" ceph balancer

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Wed, 20 Oct 2021 21:41:55 +0200

Hi,

I don't quite understand your "huge server" scenario, other than a basic
understanding that the balancer cannot do magic in some impossible cases.

But anyway, I wonder if this sort of higher order balancing could/should be
added as a "part two" to the mgr balancer. The existing code does a quite
good job in many (dare I say most?) cases. E.g. it even balances empty
clusters perfectly.
But after it cannot find a further optimization, maybe a heuristic like
yours can further refine the placement...

 Dan

On Wed, 20 Oct 2021, 20:52 Jonas Jelten, <jelten@xxxxxxxxx> wrote:

> Hi Dan,
>
> I'm not kidding, these were real-world observations, hence my motivation
> to create this balancer :)
> First I tried "fixing" the mgr balancer, but after understanding the exact
> algorithm there I thought of a completely different approach.
>
> For us the main reason things got out of balance was this (from the
> README):
> > To make things worse, if there's a huge server in the cluster which is
> so big, CRUSH can't place data often enough on it to fill it to the same
> level as any other server, the balancer will fail moving PGs across servers
> that actually would have space.
> > This happens since it sees only this server's OSDs as "underfull", but
> each PG has one shard on that server already, so no data can be moved on it.
>
> But all the aspects in that section play together, and I don't think it's
> easily improvable in mgr-balancer while keeping the same base algorithm.
>
> Cheers
>   -- Jonas
>
> On 20/10/2021 19.55, Dan van der Ster wrote:
> > Hi Jonas,
> >
> > From your readme:
> >
> > "the best possible solution is some OSDs having an offset of 1 PG to the
> ideal count. As a PG-distribution-optimization is done per pool, without
> checking other pool's distribution at all, some devices will be the +1 more
> often than others. At worst one OSD is the +1 for each pool in the cluster."
> >
> > That's an interesting observation/flaw which hadn't occurred to me
> before. I think we don't ever see it in practice in our clusters because we
> do not have multiple large pools on the same osds.
> >
> > How large are the variances in your real clusters? I hope the example in
> your readme isn't from real life??
> >
> > Cheers, Dan
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx