Re: jj's "improved" ceph balancer

E Taka <0etaka0@xxxxxxxxx> · Mon, 25 Oct 2021 17:23:23 +0200

Hi Jonas,
I'm impressed, Thanks!

I have a question about the usage: do I have to turn off the automatic
balancing feature (ceph balancer off)? Do the upmap balancer and your
customizations get in each other's way, or can I run your script from time
to time?

Thanks
Erich

Am Mo., 25. Okt. 2021 um 14:50 Uhr schrieb Jonas Jelten <jelten@xxxxxxxxx>:

> Hi Dan,
>
> basically it's this: when you have a server that is so big, crush can't
> utilize it the same way as the other smaller servers because of the
> placement constraints,
> the balancer doesn't balance data on the smaller servers any more, because
> it just "sees" the big one to be too empty.
>
> To my understanding the mgr-balancer balances hierarchically, on each
> crush level.
> It moves pgs between buckets on the same level (i.e. from too-full-rack to
> too-empty-rack, from too-full-server to too-empty server, inside a server
> from osd to another osd),
> so when there's e.g. an always-too-empty server, it kinda defeats the
> algorithm and doesn't migrate PGs even when the crush constraints would
> allow it.
> So it won't move PGs from small-server 1 (with osds at ~90% full) to
> small-server 2 (with osds at ~60%), due to server 3 with osds at 30%.
> We have servers with 12T drives and some with 1T drives, and various drive
> counts, so that this situation emerged...
> Since I saw how it could be balanced, but wasn't, I wrote the tool.
>
> I also think that the mgr-balancer approach is good, but the hierarchical
> movements are hard to adjust I think.
> But yes, I see my balancer complementary to the mgr-balancer, and for some
> time I used both (since mgr-balance is happy about my movements and just
> leaves them) and it worked well.
>
> -- Jonas
>
> On 20/10/2021 21.41, Dan van der Ster wrote:
> > Hi,
> >
> > I don't quite understand your "huge server" scenario, other than a basic
> understanding that the balancer cannot do magic in some impossible cases.
> >
> > But anyway, I wonder if this sort of higher order balancing could/should
> be added as a "part two" to the mgr balancer. The existing code does a
> quite good job in many (dare I say most?) cases. E.g. it even balances
> empty clusters perfectly.
> > But after it cannot find a further optimization, maybe a heuristic like
> yours can further refine the placement...
> >
> >  Dan
> >
> >
> > On Wed, 20 Oct 2021, 20:52 Jonas Jelten, <jelten@xxxxxxxxx <mailto:
> jelten@xxxxxxxxx>> wrote:
> >
> >     Hi Dan,
> >
> >     I'm not kidding, these were real-world observations, hence my
> motivation to create this balancer :)
> >     First I tried "fixing" the mgr balancer, but after understanding the
> exact algorithm there I thought of a completely different approach.
> >
> >     For us the main reason things got out of balance was this (from the
> README):
> >     > To make things worse, if there's a huge server in the cluster
> which is so big, CRUSH can't place data often enough on it to fill it to
> the same level as any other server, the balancer will fail moving PGs
> across servers that actually would have space.
> >     > This happens since it sees only this server's OSDs as "underfull",
> but each PG has one shard on that server already, so no data can be moved
> on it.
> >
> >     But all the aspects in that section play together, and I don't think
> it's easily improvable in mgr-balancer while keeping the same base
> algorithm.
> >
> >     Cheers
> >       -- Jonas
> >
> >     On 20/10/2021 19.55, Dan van der Ster wrote:
> >     > Hi Jonas,
> >     >
> >     > From your readme:
> >     >
> >     > "the best possible solution is some OSDs having an offset of 1 PG
> to the ideal count. As a PG-distribution-optimization is done per pool,
> without checking other pool's distribution at all, some devices will be the
> +1 more often than others. At worst one OSD is the +1 for each pool in the
> cluster."
> >     >
> >     > That's an interesting observation/flaw which hadn't occurred to me
> before. I think we don't ever see it in practice in our clusters because we
> do not have multiple large pools on the same osds.
> >     >
> >     > How large are the variances in your real clusters? I hope the
> example in your readme isn't from real life??
> >     >
> >     > Cheers, Dan
> >
> >
> > _______________________________________________
> > Dev mailing list -- dev@xxxxxxx
> > To unsubscribe send an email to dev-leave@xxxxxxx
> >
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx