Re: Adding / removing OSDs with weight set

Loic Dachary <loic@xxxxxxxxxxx> · Fri, 26 May 2017 21:24:18 +0300

On 05/26/2017 07:12 PM, Sage Weil wrote:
> On Fri, 26 May 2017, Loic Dachary wrote:
>> On 05/26/2017 04:41 PM, Sage Weil wrote:
>>> On Fri, 26 May 2017, Loic Dachary wrote:
>>>> Hi Sage,
>>>>
>>>> If weight set are created and updated, either offline via "crush 
>>>> optimize" (possible now) or via a ceph-mgr task (hopefully in the 
>>>> future), adding and removing OSDs won't work via the ceph cli 
>>>> (CrushWrapper errors out on create_or_move_item which is what osd crush 
>>>> create-or-move needs, for instance).
>>>>
>>>> Requiring a workflow where OSDs must be added to the crushmap instead of 
>>>> the usual ceph osd crush create-or-move is impractical. Instead, 
>>>> create_or_move_item should be modified to update the weight sets, if 
>>>> any. What we could not figure out a few weeks ago is which values make 
>>>> sense for a newly added OSD.
>>>>
>>>> Assuming the weight set are updated via an incremental rebalancing 
>>>> process, I think the weight set of a new OSD should simply be zero for 
>>>> all positions and the target weight is set as usual. The next time the 
>>>> rebalancing process runs, it will set the weight set to the right value 
>>>> and backfilling will start. Or, if it proceeds incrementally, it will 
>>>> gradually increase the weight set until it reaches the optimal value. 
>>>> From the user perspective, the only difference is that backfilling does 
>>>> not happen right away, it has to wait for the next rebalancing update.
>>>
>>> This is interesting!  Gradually weighting the OSD in is often a 
>>> good/desired thing anyway, so this is pretty appealing.  But,
>>>
>>>> Preparing an OSD to be decomissionned can be done by setting the target 
>>>> weight to zero. The rebalancing process will (gradually or not) set the 
>>>> weight set to zero and all PGs will move out of the OSD.
>>>
>>> more importantly, we need to make things like 'move' and 'reweight' work, 
>>> too.
>>>
>>> I think we should assume that the choose_args weights are going 
>>> to be incrementally different than the canonical weights, and make all of 
>>> these crush modifications make a best-effort attempt to preserve them.  
>>>
>>> - Remove is simple--it can just remove the entry for the removed item.
>>>
>>> - Add can either do zeros (as you suggest) or just use the canonical 
>>> weight (and let subsequent optimization optimize).
>>>
>>> - Move is trickier, but I think the simplest is just to treat it as an add 
>>> and remove.
>>
>> Yes. If adding an OSD always sets the weight set to zero it means that 
>> moving an OSD from a bucket to another will
>>
>> a) move all the PGs it contains to other OSDs because its weight is zero 
>> and
> 
>> b) receive other PGs at its new location when the weight is set to a non 
>> zero value
>>
>> If it turns out that the PGs in the new location are the mostly the same 
>> as the PGs in the old location, it's useless data movement. I can't 
>> think of a case where that would happen though. Am I missing something ?
>>
>>> - Similarly, when you add or move and item, the parent buckets' weights 
>>> increase or decrease.  For those adjustments, I think the choose_args 
>>> weights should be scaled proportionally.  (This is likely to be the 
>>> "right" thing both for optimizations of the specific pgid inputs, and 
>>> probably pretty close for the multipick-anomaly optimization too.)
>>>
>>> That leaves the 'add' behavior as the big question mark (should it start 
>>> at 0 or at the canonical weight).  My inclination is to go with either the 
>>> canonical weight or have an option to choose which you want (and have that 
>>> start at the canonical weight).  It's not going to make sense to start at 
>>> 0 until we have an automated mgr thing that does the optimization and 
>>> throttles itself to move slowly.  Once that is in place then having things 
>>> weight up from 0 makes a lot of sense (even as the default) but until then 
>>> I don't think we can have a default behavior rely on an external 
>>> optimization process being in place...
>>>
>>> What do you think?
>>
>> I can't think of a scenario where someone would add weight_set (via 
>> crush optimize or otherwise) and want a default weight set for a newly 
>> added OSD other than zero.
> 
> The problem I still see if that if you have something like
> 
> 4  root
> 2   rack1
> 1     host1
> 1     host2
> 2   rack2
> 1     host3
> 1     host4
> 
> and then move host3 into rack1, zeroing means you get
> 
> 3  root
> 2   rack1
> 1     host1
> 1     host2
> 0     host3
> 1   rack2
> 1     host4
> 
> ...which is going to put 33% more data on host{1,2,4} until the optimizer 
> runs.  In that case, I think preserving host3's weight won't be perfect 
> (different pgs)... but it will be much closer than 0!  In both cases, the 
> actual data movement is going to be slow, so the optimizer is looking at 
> the 'up' set it can sort out the adjustments before much data actually 
> moves.  But it will be a simpler optimization problem (only a few steps, 
> probably) to go from 1.03 -> .98 (for example) than from 0 -> .98.
> 
> In other words, I think your argument that 'the optimizer will fix it' 
> cuts both ways: it can fix 0 or the canonical weight.  And I think that in 
> almost all cases the canonical weight will be closer than 0?
> 
>> If using crush optimize, rebalancing will start at the root of the rule 
>> down to the bucket containing the new OSD. Since adding the OSD modifies 
>> the weight to the top of the hierarchy, it is likely to need 
>> rebalancing. The most common case is an uneven distribution due to a low 
>> number of PGs and we cannot predict which bucket will over or under 
>> fill. There is no predictible ratio at any level of the hierarchy 
>> because the distribution is uneven in a random way. In general the 
>> variance is lower than 50% but in some cases (CERN cluster for instance) 
>> it is higher.
> 
> The most predictable thing here is the canonical/target weight.  It won't 
> be perfect, but it will be close.. that's why it's worked up until now.
> 
>> In the case of the multipick anomaly we know that OSDs 
>> with the lowest weights will overfill and those with higher weights will 
>> underfill and we could try to figure out a non zero default weight that 
>> makes sense but that would be tricky.
> 
> True.  In the case of move, preserving the old weights across the most 
> will preserve some of this adjustment.  How close it is depends on whether 
> the source and destination buckets are similarly structured.  I think the 
> main argument for forcing to 0 in this case is if it is a Very Bad Thing 
> to overshoot.  And I don't think that's the case: backfill will stop 
> moving data to a device before it fills up, and if we're talking about 
> filling a device at all that is going to take a long time--1/2 a day at 
> least, probably more--which should be plenty of time for the optimizer to 
> come in and correct it.
>>
>> If the weight set are modified with another logic in mind, a zero weight 
>> is not going to disturb it. A contrario, a default weight set based on 
>> the target weight or the ratio between the target weight and the weight 
>> set of other OSDs is unlikely to match whatever rationale was used to 
>> set the weight set. And PGs will likely move back and forth uselessly 
>> because of that.
> 
> Either way PGs will move--they have to go somewhere:
> 
> 1- If the moved item's weight is 0, then PGs will move to compensate for 
> the reduced weight at the old location, and go to random other places in 
> the hierarchy.  Probably *all* of those PGs will have to move back (or 
> elsewhere) once the item's weight is optimized to its final value.
> 
> 2- If the moved item's (canonical) weight is too high (higher than the 
> optimized weight we don't know yet), then a few more PGs will move than 
> should have, and will have to move back.  How many depends on how much 
> higher we were; if it was 5% to high, then ~5% will move back.
> 
> 3- If the moved item's (cannoical) weight is too low, then you'll have a 
> blend of the above.  But mostly case #2.
> 
> So I think actually the useless PG movement will be lower with the 
> canonical weight than with 0.
> 
> (For a *new* OSD, starting at 0 makes sense, though!)
> 
>> In other words, I came to realize that it is highly unlikely that 
>> someone will try to manually set the weight set and that whatever tool 
>> they are using, a zero weight will always be a sane default. Note that 
>> this *only* applies to weight set, if there are choose args and does not 
>> otherwise change anything if choose args are not set.
> 
> Oh, right... I was missing was that there is not really a case where users 
> are manually setting these choose_args; if they are using them at all then 
> we can assume they can deal with the fallout from a move.  Even so, I 
> think the fallout is smaller with an approximation than with 0.
> 
>> The big question mark for me is your intuition that choose_args weights 
>> should be scaled proportionaly. It would be great if you could expand on 
>> that so I better understand it.
> 
> Here I'm talking about the rack{1,2} weights.  For this example, say they 
> are 1.96 and 2.04, respetively, before the move.  Afterwards, some but 
> not all of the inputs have changed because rack1 still has host{1,2} 
> and rack2 still has host4.  So if rack1's canonical weight goes 2 -> 3, 
> that's a 50% increase, and we can scale 1.96 -> 2.94.  That's likely to be 
> close to the real optimal value (say, 2.97).
> 
> ?
> s

Thanks for taking the time to explain. The main source of confusion was that I thought you suggested estimating weights for new OSDs. I did not think carefully about the case where OSD move around and what you're describing makes perfect sense. Cool :-)

I'll amend https://github.com/ceph/ceph/pull/15311 accordingly.

Cheers

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html