Re: [ceph-users] jj's "improved" ceph balancer

Jonas Jelten <jelten@xxxxxxxxx> · Mon, 25 Oct 2021 14:48:55 +0200

Hi Dan,

basically it's this: when you have a server that is so big, crush can't utilize it the same way as the other smaller servers because of the placement constraints,
the balancer doesn't balance data on the smaller servers any more, because it just "sees" the big one to be too empty.

To my understanding the mgr-balancer balances hierarchically, on each crush level.
It moves pgs between buckets on the same level (i.e. from too-full-rack to too-empty-rack, from too-full-server to too-empty server, inside a server from osd to another osd),
so when there's e.g. an always-too-empty server, it kinda defeats the algorithm and doesn't migrate PGs even when the crush constraints would allow it.
So it won't move PGs from small-server 1 (with osds at ~90% full) to small-server 2 (with osds at ~60%), due to server 3 with osds at 30%.
We have servers with 12T drives and some with 1T drives, and various drive counts, so that this situation emerged...
Since I saw how it could be balanced, but wasn't, I wrote the tool.

I also think that the mgr-balancer approach is good, but the hierarchical movements are hard to adjust I think.
But yes, I see my balancer complementary to the mgr-balancer, and for some time I used both (since mgr-balance is happy about my movements and just leaves them) and it worked well.

-- Jonas

On 20/10/2021 21.41, Dan van der Ster wrote:
> Hi,
> 
> I don't quite understand your "huge server" scenario, other than a basic understanding that the balancer cannot do magic in some impossible cases.
> 
> But anyway, I wonder if this sort of higher order balancing could/should be added as a "part two" to the mgr balancer. The existing code does a quite good job in many (dare I say most?) cases. E.g. it even balances empty clusters perfectly.
> But after it cannot find a further optimization, maybe a heuristic like yours can further refine the placement...
> 
>  Dan
> 
> 
> On Wed, 20 Oct 2021, 20:52 Jonas Jelten, <jelten@xxxxxxxxx <mailto:jelten@xxxxxxxxx>> wrote:
> 
>     Hi Dan,
> 
>     I'm not kidding, these were real-world observations, hence my motivation to create this balancer :)
>     First I tried "fixing" the mgr balancer, but after understanding the exact algorithm there I thought of a completely different approach.
> 
>     For us the main reason things got out of balance was this (from the README):
>     > To make things worse, if there's a huge server in the cluster which is so big, CRUSH can't place data often enough on it to fill it to the same level as any other server, the balancer will fail moving PGs across servers that actually would have space.
>     > This happens since it sees only this server's OSDs as "underfull", but each PG has one shard on that server already, so no data can be moved on it.
> 
>     But all the aspects in that section play together, and I don't think it's easily improvable in mgr-balancer while keeping the same base algorithm.
> 
>     Cheers
>       -- Jonas
> 
>     On 20/10/2021 19.55, Dan van der Ster wrote:
>     > Hi Jonas,
>     >
>     > From your readme:
>     >
>     > "the best possible solution is some OSDs having an offset of 1 PG to the ideal count. As a PG-distribution-optimization is done per pool, without checking other pool's distribution at all, some devices will be the +1 more often than others. At worst one OSD is the +1 for each pool in the cluster."
>     >
>     > That's an interesting observation/flaw which hadn't occurred to me before. I think we don't ever see it in practice in our clusters because we do not have multiple large pools on the same osds.
>     >
>     > How large are the variances in your real clusters? I hope the example in your readme isn't from real life??
>     >
>     > Cheers, Dan
> 
> 
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
> 

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx