Re: [ceph-users] jj's "improved" ceph balancer

Josh Salomon <jsalomon@xxxxxxxxxx> · Wed, 20 Oct 2021 23:49:39 +0300

inside
Regards,
Josh

On Wed, Oct 20, 2021 at 11:25 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
Hi Josh,
Okay, but do you agree that for any given pool, the load is uniform across it's PGs, right?
Of course, but you still need to work pool by pool to balance the workload. 

Doesn't the existing mgr balancer already balance the PGs for each pool individually? So in your example, the PGs from the loaded pool will be balanced across all osds, as will the idle pool's PGs. So the net load is uniform, right?
Short answer: This is correct, but this is not how JJs balancer works (IIUC from the readme).
Long answer: The balancer API can work on multiple pools, but the clients (tvsche mgr balancer and osdmaptool) call it pool by pool. IMHO the signature of the method (OSDMap::calc_pg_upmaps) should be changed to reflect this.   

OTOH I could see a workload/capacity imbalance if there are mixed capacity but equal performance devices (e.g. a cluster with 50% 6TB HDDs and 50% 12TB HDDs). 
You are absolutely correct, we plan on improving this in future versions (after Quincy) - it can be improved in some level pending on the r/w ratio of the workloads, but in the extreme case (some capacity on 1TB devices and some on 6TB devices) the workload can't be balanced. In these cases if you are sensitive to performance it is recommended not to blend such different devices in the same pool. I talked about such situations in my last SDC presentation (but most of the presentation is on another usecase of blending EBS and local storage on the same pool in AWS)  
In that case we're probably better to treat the disks as uniform in size until the smaller osds fill up.

As I said in some cases we can balance the workloads by putting more primaries on the smaller devices, but this can work up to a certain level only (depends on the replica number and on the r/w ratio of the pools, I am not sure yet whether this applies only to EC pools - it can somewhat apply to EC when not in "fast read" mode )

.. Dan

On Wed, 20 Oct 2021, 22:09 Josh Salomon, <jsalomon@xxxxxxxxxx> wrote:
Hi Dan,

Assume you have 2 pools with the same used capacity and the same number of PGs, one gets 10x the IOs than the other. From capacity balancing perspectives all the PGs look identical, but devices with PGs from one pool will get 10%  of the IOs as devices with PGs only from the second pool. Under load almost all the load will go to the latter devices while the former will be almost idle, which makes very bad use of the cluster bandwidth.
This is an extreme case, but even in the case that the PGs are blended but not ideally (even one device has more PGs from the loaded pool and it is not split 50-50) we get weakest link in the chain effect on that pool and under load it will provide less than optimal bandwidth from the cluster.

IMHO it should be correct also when the cluster is almost full and not limited to half full clusters.

I do agree with the observation of bad +1 PG splits among the OSDs and I believe this should be fixed. I am not sure I fully understood the huge node use case, if every PG has an OSD in a single node and still it is under utilized, I don't see how we can improve on this without sacrificing the reliability (by putting 2 copies on the same node).

Josh

On Wed, Oct 20, 2021 at 10:56 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
Hi Josh,
That's another interesting dimension...
Indeed a cluster that has plenty of free capacity could indeed be balanced by workload/iops, but once it reaches maybe 60 or 70% full, then I think capacity would need to take priority.

But to be honest I don't really understand the workload/iops balancing use-case. Can you describe some of the scenarios you have in mind?

.. Dan

On Wed, 20 Oct 2021, 20:45 Josh Salomon, <jsalomon@xxxxxxxxxx> wrote:
Just another point of view: The current balancer balances the capacity but this is not enough. The balancer should also balance the workload and we plan on adding primary balancing for Quincy. In order to balance the workload you should work pool by pool because pools have different workloads. So while the observation about the +1 PGs is correct, I believe the correct solution should be talking this into consideration while still balancing capacity pool by pool.
Capacity balancing is a functional requirement, while workload balancing is a performance requirement so it is important only for very loaded systems (loaded in terms of high IOPS not nearly full systems)

I would appreciate comments on this thought. 

On Wed, 20 Oct 2021, 20:57 Dan van der Ster, <dan@xxxxxxxxxxxxxx> wrote:
Hi Jonas,
From your readme:

"the best possible solution is some OSDs having an offset of 1 PG to the ideal count. As a PG-distribution-optimization is done per pool, without checking other pool's distribution at all, some devices will be the +1 more often than others. At worst one OSD is the +1 for each pool in the cluster."

That's an interesting observation/flaw which hadn't occurred to me before. I think we don't ever see it in practice in our clusters because we do not have multiple large pools on the same osds.

How large are the variances in your real clusters? I hope the example in your readme isn't from real life??

Cheers, Dan

On Wed, 20 Oct 2021, 15:11 Jonas Jelten, <jelten@xxxxxxxxx> wrote:
Hi!

I've been working on this for quite some time now and I think it's ready for some broader testing and feedback.

https://github.com/TheJJ/ceph-balancer

It's an alternative standalone balancer implementation, optimizing for equal OSD storage utilization and PG placement across all pools.

It doesn't change your cluster in any way, it just prints the commands you can run to apply the PG movements.

Please play around with it :)

Quickstart example: generate 10 PG movements on hdd to stdout

    ./placementoptimizer.py -v balance --max-pg-moves 10 --only-crushclass hdd | tee /tmp/balance-upmaps

When there's remapped pgs (e.g. by applying the above upmaps), you can inspect progress with:

    ./placementoptimizer.py showremapped

    ./placementoptimizer.py showremapped --by-osd

And you can get a nice Pool and OSD usage overview:

    ./placementoptimizer.py show --osds --per-pool-count --sort-utilization

Of course there's many more features and optimizations to be added,

but it served us very well in reclaiming terrabytes of until then unavailable storage already where the `mgr balancer` could no longer optimize.

What do you think?

Cheers

  -- Jonas

_______________________________________________

ceph-users mailing list -- ceph-users@xxxxxxx

To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________

Dev mailing list -- dev@xxxxxxx

To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx