Re: [ceph-users] jj's "improved" ceph balancer

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Wed, 20 Oct 2021 14:18:29 -0700

> On Oct 20, 2021, at 1:49 PM, Josh Salomon <jsalomon@xxxxxxxxxx> wrote:
> 
> but in the extreme case (some capacity on 1TB devices and some on 6TB devices) the workload can't be balanced. I

It’s also super easy in such a scenario to

a) Have the larger drives not uniformly spread across failure domains, which can lead to fractional capacity that is unusuable because it can’t meet replication policy.

b) Find the OSDs on the larger drives exceeding the configured max PG per OSD figure and refusing to activate, especially when maintenance, failures, or other topology changes precipitate recovery.  This has bitten me with a mix of 1.x and 3.84 TB drives; I ended up raising the limit to 1000 while I juggled drives, nodes, and clusters so that a given cluster had uniformly sized drives.  At smaller scales of course that often won’t be an option.

> primary affinity can help with a single pool - with multiple pools with different r/w ratio it becomes messy since pa is per device - it could help more if it was per device/pool pair. Also it could be more useful if the values were not 0-1 but 0-replica_count, but this is a usability issue, not functional, it just makes the use more cumbersome. It was designed for a different purpose though so this is not the "right" solution, the right solution is primary balancer.   

Absolutely.  I had the luxury of clusters containing a single pool.  In the above instance, before refactoring the nodes/drives, we achieved an easy 15-20% increase in aggregate read performance by applying a very rough guestimate of affinities based on OSD size.  The straw-draw factor does complicate deriving the *optimal* mapping of values, especially when topology changes.

I’ve seen someone set the CRUSH weight of larger/outlier OSDs artificially low to balance workload.  All depends on the topology, future plans, and local priorities.

> I don't quite understand your "huge server" scenario, other than a basic understanding that the balancer cannot do magic in some impossible cases.

I read it as describing a cluster where nodes / failure domains have significantly non-uniform CRUSH weights.  Which is suboptimal, but sometimes folks don’t have a choice.  Or during migration between chassis generations.  Back around … Firefly I think it was, there were a couple of bugs that resulted in undesirable behavior in those scenarios.

— aad

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx