Re: [ceph-users] jj's "improved" ceph balancer

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Jonas,

I have some comments -
IMHO you should replace 3 & 4 - if the PGs are not split optimally between the OSDs per pool, the primary balancing will not help, so I believe 4 is more important than 3.
There is also a practical reason for this, after we have 1,2,3 constraints fulfilled, we can implement 4 by just changing the order of OSDs inside the PGs (at least for replica) which is a cheap operation since it is only upmap operation and does not require any data movement. 
Regards,

Josh


On Mon, Oct 25, 2021 at 9:01 PM Jonas Jelten <jelten@xxxxxxxxx> wrote:
Hi Josh,

yes, there's many factors to optimize... which makes it kinda hard to achieve an optimal solution.

I think we have to consider all these things, in ascending priority:

* 1: Minimize distance to CRUSH (prefer fewest upmaps, and remove upmap items if balance is better)
* 2: Relocation of PGs in remapped state (since they are not fully moved yet, hence 'easier' to relocate)
* 3: Per-Pool PG distribution, respecting OSD device size -> ideal_pg_count = osd_size * (pg_num / sum(possible_osd_sizes)
* 4: Primary/EC-N distribution (all osds have equal primary/EC-N counts, for workload balancing, not respecting device size (for hdd at least?), else this is just 3)
* 5: Capacity balancing (all osds equally full)
* 6: And of course CRUSH constraints

Beautiful optimization problem, which could be fed into a solver :)
My approach currently optimizes for 3, 5, 6, iteratively...

> My only comment about what you did is that it should somehow work pool by pool and manage the +-1 globally.

I think this is already implemented!
Since each iteration I pick the "fullest" device first, it has to have more pools (or data) than other OSDs (e.g. through +1), and we try to migrate a PG off it.
And we only migrate a particular PG of a pool from such a source OSD if it has >ideal_amount_for_pool (float, hence we allow moving +1s or worse).
Same for a destination OSD, it's only selected if has other PGs of that pool <ideal_amount_for_pool (float, hence allowing it to become a +1 but not more).
So we eliminate global imbalance, and respect equal PG distribution per pool.

I can try to hack in (optional) constraints so it also supports optimization 4, but this works very much against the CRUSH placement (because we'd have to ignore OSD size).
But since this is basically bypassing CRUSH-weights, it could also be done by placing all desired devices in a custom crush hierarchy with identical weighted buckets (even though "wasting" storage).
Then we don't have to fight CRUSH and it's a 'simple' optimization 3 again.

To achieve 2 and 1 it's just an re-ordering of candidate PGs.
So in theory it should be doable™.

-- Jonas


On 25/10/2021 11.12, Josh Salomon wrote:
> Hi Jonas,
>
> I want to clarify a bit my thoughts (it may be long) regarding balancing in general.
>
> 1 - Balancing the capacity correctly is of top priority, this is because we all know that the system is as full as the fullest device and as a storage system we can't allow large capacity which is wasted and can't be used. This is a top functional requirement.
> 2 - Workload balancing is a performance requirement, and an important one, but we should not optimize workload on behalf of capacity so the challenge is how to do both simultaneously. (hint: it is not always possible, but when this is not possible the system performs less than the aggregated performance of the devices)
>
> Assumption 1: Per pool the workload on a PG is linear with the capacity, which means either all PGs have the same workload (#PGs is a power of 2) or some PGs has exactly twice the load as the others. From now on I will assume the number of PGs is a power of 2, since the adjustments to the other case are pretty simple. 
>
> Conclusion 1: Balancing capacity based on all the PGs is the system may cause workload imbalance - balancing capacity should be done on a pool by pool basis. (assume 2 pools H(ot) and C(old) with exactly the same settings (#PGs, capacity and protection scheme). If you balance per PG capacity only you can have a device with all the PGs from C pool and a device with all the PGs from the H pool -
> This will cause the second device to be fully loaded while the first device is idle). 
>
> On the other hand, your point about the +-1 PGs when working on a pool by pool basis is correct and should be fixed (when working on pool by pool basis)
>
> When all the devices are identical, the other thing we need to do for balancing the workload is balancing the primaries (on a pool by pool basis) - this means that when the capacity is balanced (every OSD has the same number of PGs per pool) every OSD has also the same number of primaries (+-1) per pool. This is mainly important for replicated pools, for EC pools it is important (but less
> critical) when working without "fast read" mode, and does not have any effect with EC pools with "fast read" mode enabled. (For EC pools we need to balance the N OSDs from N+K and not only the primaries - think about replica-3 as a special case of EC with 1+2)
>
> Now what happens when the devices are not identical - 
> In case of mixing technologies (SSD and HDD) - (this is not recommended, but you can see some use cases for this in my SDC presentation <https://www.youtube.com/watch?v=dz53aH2XggE&feature=emb_imp_woyt>) - without going into deep details the easiest solution is make all the faster (I mean much faster such as HDD/SSD or SSD/PM) devices always primaries and all the slow devices never primaries
> (assuming you always keep at least one copy on a fast device). More on this in the presentation.  
>
> The last case is when there are relatively minor performance differences between the devices (HDD with different RPM rate, or devices with the same technology and not the same size, but not a huge difference - I believe that when on device has X times the capacity as others when X > replica-count, we can't balance any more, but I need to complete my calculations). In these cases, assuming we know
> something about the workload (R/W ratio) we can balance workload by giving more primaries to the faster or smaller devices relative to the slower or larger devices. This may not be optimal but can improve the performance, obviously it will not work for write only workloads, but it can improve performance as the ratio of reads is higher. 
>
> So to summarize - we need first to balance capacity as perfectly as possible, but if we care about performance we should make sure that the capacity per each pool is balanced almost perfectly. Then we change the primaries based on the devices we have and on the workloads per pool in order to split the workload evenly among the devices. When we have large variance in the devices in the same pool,
> perfect workload balancing may not be achievable, but we can try and find an optimal one for the configuration and workload we have.     
>
> Having said all that - I really appreciate your work, and I went briefly over it. My only comment about what you did is that it should somehow work pool by pool and manage the +-1 globally. 
>
> Regards,
>
> Josh

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux