On 05/26/2017 07:12 PM, Sage Weil wrote: > On Fri, 26 May 2017, Loic Dachary wrote: >> On 05/26/2017 04:41 PM, Sage Weil wrote: >>> On Fri, 26 May 2017, Loic Dachary wrote: >>>> Hi Sage, >>>> >>>> If weight set are created and updated, either offline via "crush >>>> optimize" (possible now) or via a ceph-mgr task (hopefully in the >>>> future), adding and removing OSDs won't work via the ceph cli >>>> (CrushWrapper errors out on create_or_move_item which is what osd crush >>>> create-or-move needs, for instance). >>>> >>>> Requiring a workflow where OSDs must be added to the crushmap instead of >>>> the usual ceph osd crush create-or-move is impractical. Instead, >>>> create_or_move_item should be modified to update the weight sets, if >>>> any. What we could not figure out a few weeks ago is which values make >>>> sense for a newly added OSD. >>>> >>>> Assuming the weight set are updated via an incremental rebalancing >>>> process, I think the weight set of a new OSD should simply be zero for >>>> all positions and the target weight is set as usual. The next time the >>>> rebalancing process runs, it will set the weight set to the right value >>>> and backfilling will start. Or, if it proceeds incrementally, it will >>>> gradually increase the weight set until it reaches the optimal value. >>>> From the user perspective, the only difference is that backfilling does >>>> not happen right away, it has to wait for the next rebalancing update. >>> >>> This is interesting! Gradually weighting the OSD in is often a >>> good/desired thing anyway, so this is pretty appealing. But, >>> >>>> Preparing an OSD to be decomissionned can be done by setting the target >>>> weight to zero. The rebalancing process will (gradually or not) set the >>>> weight set to zero and all PGs will move out of the OSD. >>> >>> more importantly, we need to make things like 'move' and 'reweight' work, >>> too. >>> >>> I think we should assume that the choose_args weights are going >>> to be incrementally different than the canonical weights, and make all of >>> these crush modifications make a best-effort attempt to preserve them. >>> >>> - Remove is simple--it can just remove the entry for the removed item. >>> >>> - Add can either do zeros (as you suggest) or just use the canonical >>> weight (and let subsequent optimization optimize). >>> >>> - Move is trickier, but I think the simplest is just to treat it as an add >>> and remove. >> >> Yes. If adding an OSD always sets the weight set to zero it means that >> moving an OSD from a bucket to another will >> >> a) move all the PGs it contains to other OSDs because its weight is zero >> and > >> b) receive other PGs at its new location when the weight is set to a non >> zero value >> >> If it turns out that the PGs in the new location are the mostly the same >> as the PGs in the old location, it's useless data movement. I can't >> think of a case where that would happen though. Am I missing something ? >> >>> - Similarly, when you add or move and item, the parent buckets' weights >>> increase or decrease. For those adjustments, I think the choose_args >>> weights should be scaled proportionally. (This is likely to be the >>> "right" thing both for optimizations of the specific pgid inputs, and >>> probably pretty close for the multipick-anomaly optimization too.) >>> >>> That leaves the 'add' behavior as the big question mark (should it start >>> at 0 or at the canonical weight). My inclination is to go with either the >>> canonical weight or have an option to choose which you want (and have that >>> start at the canonical weight). It's not going to make sense to start at >>> 0 until we have an automated mgr thing that does the optimization and >>> throttles itself to move slowly. Once that is in place then having things >>> weight up from 0 makes a lot of sense (even as the default) but until then >>> I don't think we can have a default behavior rely on an external >>> optimization process being in place... >>> >>> What do you think? >> >> I can't think of a scenario where someone would add weight_set (via >> crush optimize or otherwise) and want a default weight set for a newly >> added OSD other than zero. > > The problem I still see if that if you have something like > > 4 root > 2 rack1 > 1 host1 > 1 host2 > 2 rack2 > 1 host3 > 1 host4 > > and then move host3 into rack1, zeroing means you get > > 3 root > 2 rack1 > 1 host1 > 1 host2 > 0 host3 > 1 rack2 > 1 host4 > > ...which is going to put 33% more data on host{1,2,4} until the optimizer > runs. In that case, I think preserving host3's weight won't be perfect > (different pgs)... but it will be much closer than 0! In both cases, the > actual data movement is going to be slow, so the optimizer is looking at > the 'up' set it can sort out the adjustments before much data actually > moves. But it will be a simpler optimization problem (only a few steps, > probably) to go from 1.03 -> .98 (for example) than from 0 -> .98. > > In other words, I think your argument that 'the optimizer will fix it' > cuts both ways: it can fix 0 or the canonical weight. And I think that in > almost all cases the canonical weight will be closer than 0? > >> If using crush optimize, rebalancing will start at the root of the rule >> down to the bucket containing the new OSD. Since adding the OSD modifies >> the weight to the top of the hierarchy, it is likely to need >> rebalancing. The most common case is an uneven distribution due to a low >> number of PGs and we cannot predict which bucket will over or under >> fill. There is no predictible ratio at any level of the hierarchy >> because the distribution is uneven in a random way. In general the >> variance is lower than 50% but in some cases (CERN cluster for instance) >> it is higher. > > The most predictable thing here is the canonical/target weight. It won't > be perfect, but it will be close.. that's why it's worked up until now. > >> In the case of the multipick anomaly we know that OSDs >> with the lowest weights will overfill and those with higher weights will >> underfill and we could try to figure out a non zero default weight that >> makes sense but that would be tricky. > > True. In the case of move, preserving the old weights across the most > will preserve some of this adjustment. How close it is depends on whether > the source and destination buckets are similarly structured. I think the > main argument for forcing to 0 in this case is if it is a Very Bad Thing > to overshoot. And I don't think that's the case: backfill will stop > moving data to a device before it fills up, and if we're talking about > filling a device at all that is going to take a long time--1/2 a day at > least, probably more--which should be plenty of time for the optimizer to > come in and correct it. >> >> If the weight set are modified with another logic in mind, a zero weight >> is not going to disturb it. A contrario, a default weight set based on >> the target weight or the ratio between the target weight and the weight >> set of other OSDs is unlikely to match whatever rationale was used to >> set the weight set. And PGs will likely move back and forth uselessly >> because of that. > > Either way PGs will move--they have to go somewhere: > > 1- If the moved item's weight is 0, then PGs will move to compensate for > the reduced weight at the old location, and go to random other places in > the hierarchy. Probably *all* of those PGs will have to move back (or > elsewhere) once the item's weight is optimized to its final value. > > 2- If the moved item's (canonical) weight is too high (higher than the > optimized weight we don't know yet), then a few more PGs will move than > should have, and will have to move back. How many depends on how much > higher we were; if it was 5% to high, then ~5% will move back. > > 3- If the moved item's (cannoical) weight is too low, then you'll have a > blend of the above. But mostly case #2. > > So I think actually the useless PG movement will be lower with the > canonical weight than with 0. > > (For a *new* OSD, starting at 0 makes sense, though!) > >> In other words, I came to realize that it is highly unlikely that >> someone will try to manually set the weight set and that whatever tool >> they are using, a zero weight will always be a sane default. Note that >> this *only* applies to weight set, if there are choose args and does not >> otherwise change anything if choose args are not set. > > Oh, right... I was missing was that there is not really a case where users > are manually setting these choose_args; if they are using them at all then > we can assume they can deal with the fallout from a move. Even so, I > think the fallout is smaller with an approximation than with 0. > >> The big question mark for me is your intuition that choose_args weights >> should be scaled proportionaly. It would be great if you could expand on >> that so I better understand it. > > Here I'm talking about the rack{1,2} weights. For this example, say they > are 1.96 and 2.04, respetively, before the move. Afterwards, some but > not all of the inputs have changed because rack1 still has host{1,2} > and rack2 still has host4. So if rack1's canonical weight goes 2 -> 3, > that's a 50% increase, and we can scale 1.96 -> 2.94. That's likely to be > close to the real optimal value (say, 2.97). > > ? > s Thanks for taking the time to explain. The main source of confusion was that I thought you suggested estimating weights for new OSDs. I did not think carefully about the case where OSD move around and what you're describing makes perfect sense. Cool :-) I'll amend https://github.com/ceph/ceph/pull/15311 accordingly. Cheers Cheers -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html