Hi, I don't quite understand your "huge server" scenario, other than a basic understanding that the balancer cannot do magic in some impossible cases. But anyway, I wonder if this sort of higher order balancing could/should be added as a "part two" to the mgr balancer. The existing code does a quite good job in many (dare I say most?) cases. E.g. it even balances empty clusters perfectly. But after it cannot find a further optimization, maybe a heuristic like yours can further refine the placement... Dan On Wed, 20 Oct 2021, 20:52 Jonas Jelten, <jelten@xxxxxxxxx> wrote: > Hi Dan, > > I'm not kidding, these were real-world observations, hence my motivation > to create this balancer :) > First I tried "fixing" the mgr balancer, but after understanding the exact > algorithm there I thought of a completely different approach. > > For us the main reason things got out of balance was this (from the > README): > > To make things worse, if there's a huge server in the cluster which is > so big, CRUSH can't place data often enough on it to fill it to the same > level as any other server, the balancer will fail moving PGs across servers > that actually would have space. > > This happens since it sees only this server's OSDs as "underfull", but > each PG has one shard on that server already, so no data can be moved on it. > > But all the aspects in that section play together, and I don't think it's > easily improvable in mgr-balancer while keeping the same base algorithm. > > Cheers > -- Jonas > > On 20/10/2021 19.55, Dan van der Ster wrote: > > Hi Jonas, > > > > From your readme: > > > > "the best possible solution is some OSDs having an offset of 1 PG to the > ideal count. As a PG-distribution-optimization is done per pool, without > checking other pool's distribution at all, some devices will be the +1 more > often than others. At worst one OSD is the +1 for each pool in the cluster." > > > > That's an interesting observation/flaw which hadn't occurred to me > before. I think we don't ever see it in practice in our clusters because we > do not have multiple large pools on the same osds. > > > > How large are the variances in your real clusters? I hope the example in > your readme isn't from real life?? > > > > Cheers, Dan > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx