Hi Jonas, I'm impressed, Thanks! I have a question about the usage: do I have to turn off the automatic balancing feature (ceph balancer off)? Do the upmap balancer and your customizations get in each other's way, or can I run your script from time to time? Thanks Erich Am Mo., 25. Okt. 2021 um 14:50 Uhr schrieb Jonas Jelten <jelten@xxxxxxxxx>: > Hi Dan, > > basically it's this: when you have a server that is so big, crush can't > utilize it the same way as the other smaller servers because of the > placement constraints, > the balancer doesn't balance data on the smaller servers any more, because > it just "sees" the big one to be too empty. > > To my understanding the mgr-balancer balances hierarchically, on each > crush level. > It moves pgs between buckets on the same level (i.e. from too-full-rack to > too-empty-rack, from too-full-server to too-empty server, inside a server > from osd to another osd), > so when there's e.g. an always-too-empty server, it kinda defeats the > algorithm and doesn't migrate PGs even when the crush constraints would > allow it. > So it won't move PGs from small-server 1 (with osds at ~90% full) to > small-server 2 (with osds at ~60%), due to server 3 with osds at 30%. > We have servers with 12T drives and some with 1T drives, and various drive > counts, so that this situation emerged... > Since I saw how it could be balanced, but wasn't, I wrote the tool. > > I also think that the mgr-balancer approach is good, but the hierarchical > movements are hard to adjust I think. > But yes, I see my balancer complementary to the mgr-balancer, and for some > time I used both (since mgr-balance is happy about my movements and just > leaves them) and it worked well. > > -- Jonas > > On 20/10/2021 21.41, Dan van der Ster wrote: > > Hi, > > > > I don't quite understand your "huge server" scenario, other than a basic > understanding that the balancer cannot do magic in some impossible cases. > > > > But anyway, I wonder if this sort of higher order balancing could/should > be added as a "part two" to the mgr balancer. The existing code does a > quite good job in many (dare I say most?) cases. E.g. it even balances > empty clusters perfectly. > > But after it cannot find a further optimization, maybe a heuristic like > yours can further refine the placement... > > > > Dan > > > > > > On Wed, 20 Oct 2021, 20:52 Jonas Jelten, <jelten@xxxxxxxxx <mailto: > jelten@xxxxxxxxx>> wrote: > > > > Hi Dan, > > > > I'm not kidding, these were real-world observations, hence my > motivation to create this balancer :) > > First I tried "fixing" the mgr balancer, but after understanding the > exact algorithm there I thought of a completely different approach. > > > > For us the main reason things got out of balance was this (from the > README): > > > To make things worse, if there's a huge server in the cluster > which is so big, CRUSH can't place data often enough on it to fill it to > the same level as any other server, the balancer will fail moving PGs > across servers that actually would have space. > > > This happens since it sees only this server's OSDs as "underfull", > but each PG has one shard on that server already, so no data can be moved > on it. > > > > But all the aspects in that section play together, and I don't think > it's easily improvable in mgr-balancer while keeping the same base > algorithm. > > > > Cheers > > -- Jonas > > > > On 20/10/2021 19.55, Dan van der Ster wrote: > > > Hi Jonas, > > > > > > From your readme: > > > > > > "the best possible solution is some OSDs having an offset of 1 PG > to the ideal count. As a PG-distribution-optimization is done per pool, > without checking other pool's distribution at all, some devices will be the > +1 more often than others. At worst one OSD is the +1 for each pool in the > cluster." > > > > > > That's an interesting observation/flaw which hadn't occurred to me > before. I think we don't ever see it in practice in our clusters because we > do not have multiple large pools on the same osds. > > > > > > How large are the variances in your real clusters? I hope the > example in your readme isn't from real life?? > > > > > > Cheers, Dan > > > > > > _______________________________________________ > > Dev mailing list -- dev@xxxxxxx > > To unsubscribe send an email to dev-leave@xxxxxxx > > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx