Re: [ceph-users] jj's "improved" ceph balancer

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Jonas,

I want to clarify a bit my thoughts (it may be long) regarding balancing in general.

1 - Balancing the capacity correctly is of top priority, this is because we all know that the system is as full as the fullest device and as a storage system we can't allow large capacity which is wasted and can't be used. This is a top functional requirement.
2 - Workload balancing is a performance requirement, and an important one, but we should not optimize workload on behalf of capacity so the challenge is how to do both simultaneously. (hint: it is not always possible, but when this is not possible the system performs less than the aggregated performance of the devices)

Assumption 1: Per pool the workload on a PG is linear with the capacity, which means either all PGs have the same workload (#PGs is a power of 2) or some PGs has exactly twice the load as the others. From now on I will assume the number of PGs is a power of 2, since the adjustments to the other case are pretty simple. 

Conclusion 1: Balancing capacity based on all the PGs is the system may cause workload imbalance - balancing capacity should be done on a pool by pool basis. (assume 2 pools H(ot) and C(old) with exactly the same settings (#PGs, capacity and protection scheme). If you balance per PG capacity only you can have a device with all the PGs from C pool and a device with all the PGs from the H pool - This will cause the second device to be fully loaded while the first device is idle). 

On the other hand, your point about the +-1 PGs when working on a pool by pool basis is correct and should be fixed (when working on pool by pool basis)

When all the devices are identical, the other thing we need to do for balancing the workload is balancing the primaries (on a pool by pool basis) - this means that when the capacity is balanced (every OSD has the same number of PGs per pool) every OSD has also the same number of primaries (+-1) per pool. This is mainly important for replicated pools, for EC pools it is important (but less critical) when working without "fast read" mode, and does not have any effect with EC pools with "fast read" mode enabled. (For EC pools we need to balance the N OSDs from N+K and not only the primaries - think about replica-3 as a special case of EC with 1+2)

Now what happens when the devices are not identical - 
In case of mixing technologies (SSD and HDD) - (this is not recommended, but you can see some use cases for this in my SDC presentation) - without going into deep details the easiest solution is make all the faster (I mean much faster such as HDD/SSD or SSD/PM) devices always primaries and all the slow devices never primaries (assuming you always keep at least one copy on a fast device). More on this in the presentation.  

The last case is when there are relatively minor performance differences between the devices (HDD with different RPM rate, or devices with the same technology and not the same size, but not a huge difference - I believe that when on device has X times the capacity as others when X > replica-count, we can't balance any more, but I need to complete my calculations). In these cases, assuming we know something about the workload (R/W ratio) we can balance workload by giving more primaries to the faster or smaller devices relative to the slower or larger devices. This may not be optimal but can improve the performance, obviously it will not work for write only workloads, but it can improve performance as the ratio of reads is higher. 

So to summarize - we need first to balance capacity as perfectly as possible, but if we care about performance we should make sure that the capacity per each pool is balanced almost perfectly. Then we change the primaries based on the devices we have and on the workloads per pool in order to split the workload evenly among the devices. When we have large variance in the devices in the same pool, perfect workload balancing may not be achievable, but we can try and find an optimal one for the configuration and workload we have.     

Having said all that - I really appreciate your work, and I went briefly over it. My only comment about what you did is that it should somehow work pool by pool and manage the +-1 globally. 

Regards,

Josh


On Fri, Oct 22, 2021 at 1:32 PM Jonas Jelten <jelten@xxxxxxxxx> wrote:
Hi!
How would you balance the workload? We could distribute PGs independently of the OSD sizes, assuming that a HDD can
handle a low-and-constant number of iops, say 250, no matter how big it is. If we distribute pgs just by predicted
device iops, we would optimize for workload better.

My balancer (and mgr-balancer) calculate the ideal pg/size rate for a pool, and multiply this by device size.
The pgs-per-size is the main decision for movements in mgr-balancer, for jj-balancer it's just another constraint that
has to be fulfilled.

pgs-per-iops and pgs-per-size are probably the more or less the same for NVMe devices, but definitely not for HDDs, so
different optimization strategies would be needed.

-- Jonas


On 20/10/2021 20.44, Josh Salomon wrote:
> Just another point of view:
> The current balancer balances the capacity but this is not enough. The balancer should also balance the workload and we
> plan on adding primary balancing for Quincy. In order to balance the workload you should work pool by pool because pools
> have different workloads. So while the observation about the +1 PGs is correct, I believe the correct solution should be
> talking this into consideration while still balancing capacity pool by pool.
> Capacity balancing is a functional requirement, while workload balancing is a performance requirement so it is important
> only for very loaded systems (loaded in terms of high IOPS not nearly full systems)
>
> I would appreciate comments on this thought.
>
> On Wed, 20 Oct 2021, 20:57 Dan van der Ster, <dan@xxxxxxxxxxxxxx <mailto:dan@xxxxxxxxxxxxxx>> wrote:
>
>     Hi Jonas,
>
>      From your readme:
>
>     "the best possible solution is some OSDs having an offset of 1 PG to the ideal count. As a
>     PG-distribution-optimization is done per pool, without checking other pool's distribution at all, some devices will
>     be the +1 more often than others. At worst one OSD is the +1 for each pool in the cluster."
>
>     That's an interesting observation/flaw which hadn't occurred to me before. I think we don't ever see it in practice
>     in our clusters because we do not have multiple large pools on the same osds.
>
>     How large are the variances in your real clusters? I hope the example in your readme isn't from real life??
>
>     Cheers, Dan
>
>
>
>
>
>
>
>
>
>
>
>
>     On Wed, 20 Oct 2021, 15:11 Jonas Jelten, <jelten@xxxxxxxxx <mailto:jelten@xxxxxxxxx>> wrote:
>
>         Hi!
>
>         I've been working on this for quite some time now and I think it's ready for some broader testing and feedback.
>
>         https://github.com/TheJJ/ceph-balancer <https://github.com/TheJJ/ceph-balancer>
>
>         It's an alternative standalone balancer implementation, optimizing for equal OSD storage utilization and PG
>         placement across all pools.
>
>         It doesn't change your cluster in any way, it just prints the commands you can run to apply the PG movements.
>         Please play around with it :)
>
>         Quickstart example: generate 10 PG movements on hdd to stdout
>
>              ./placementoptimizer.py -v balance --max-pg-moves 10 --only-crushclass hdd | tee /tmp/balance-upmaps
>
>         When there's remapped pgs (e.g. by applying the above upmaps), you can inspect progress with:
>
>              ./placementoptimizer.py showremapped
>              ./placementoptimizer.py showremapped --by-osd
>
>         And you can get a nice Pool and OSD usage overview:
>
>              ./placementoptimizer.py show --osds --per-pool-count --sort-utilization
>
>
>         Of course there's many more features and optimizations to be added,
>         but it served us very well in reclaiming terrabytes of until then unavailable storage already where the `mgr
>         balancer` could no longer optimize.
>
>         What do you think?
>
>         Cheers
>            -- Jonas
>         _______________________________________________
>         ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
>         To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
>
>     _______________________________________________
>     Dev mailing list -- dev@xxxxxxx <mailto:dev@xxxxxxx>
>     To unsubscribe send an email to dev-leave@xxxxxxx <mailto:dev-leave@xxxxxxx>
>
>
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
>
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux