On Fri, 26 May 2017, Loic Dachary wrote: > On 05/26/2017 04:41 PM, Sage Weil wrote: > > On Fri, 26 May 2017, Loic Dachary wrote: > >> Hi Sage, > >> > >> If weight set are created and updated, either offline via "crush > >> optimize" (possible now) or via a ceph-mgr task (hopefully in the > >> future), adding and removing OSDs won't work via the ceph cli > >> (CrushWrapper errors out on create_or_move_item which is what osd crush > >> create-or-move needs, for instance). > >> > >> Requiring a workflow where OSDs must be added to the crushmap instead of > >> the usual ceph osd crush create-or-move is impractical. Instead, > >> create_or_move_item should be modified to update the weight sets, if > >> any. What we could not figure out a few weeks ago is which values make > >> sense for a newly added OSD. > >> > >> Assuming the weight set are updated via an incremental rebalancing > >> process, I think the weight set of a new OSD should simply be zero for > >> all positions and the target weight is set as usual. The next time the > >> rebalancing process runs, it will set the weight set to the right value > >> and backfilling will start. Or, if it proceeds incrementally, it will > >> gradually increase the weight set until it reaches the optimal value. > >> From the user perspective, the only difference is that backfilling does > >> not happen right away, it has to wait for the next rebalancing update. > > > > This is interesting! Gradually weighting the OSD in is often a > > good/desired thing anyway, so this is pretty appealing. But, > > > >> Preparing an OSD to be decomissionned can be done by setting the target > >> weight to zero. The rebalancing process will (gradually or not) set the > >> weight set to zero and all PGs will move out of the OSD. > > > > more importantly, we need to make things like 'move' and 'reweight' work, > > too. > > > > I think we should assume that the choose_args weights are going > > to be incrementally different than the canonical weights, and make all of > > these crush modifications make a best-effort attempt to preserve them. > > > > - Remove is simple--it can just remove the entry for the removed item. > > > > - Add can either do zeros (as you suggest) or just use the canonical > > weight (and let subsequent optimization optimize). > > > > - Move is trickier, but I think the simplest is just to treat it as an add > > and remove. > > Yes. If adding an OSD always sets the weight set to zero it means that > moving an OSD from a bucket to another will > > a) move all the PGs it contains to other OSDs because its weight is zero > and > b) receive other PGs at its new location when the weight is set to a non > zero value > > If it turns out that the PGs in the new location are the mostly the same > as the PGs in the old location, it's useless data movement. I can't > think of a case where that would happen though. Am I missing something ? > > > - Similarly, when you add or move and item, the parent buckets' weights > > increase or decrease. For those adjustments, I think the choose_args > > weights should be scaled proportionally. (This is likely to be the > > "right" thing both for optimizations of the specific pgid inputs, and > > probably pretty close for the multipick-anomaly optimization too.) > > > > That leaves the 'add' behavior as the big question mark (should it start > > at 0 or at the canonical weight). My inclination is to go with either the > > canonical weight or have an option to choose which you want (and have that > > start at the canonical weight). It's not going to make sense to start at > > 0 until we have an automated mgr thing that does the optimization and > > throttles itself to move slowly. Once that is in place then having things > > weight up from 0 makes a lot of sense (even as the default) but until then > > I don't think we can have a default behavior rely on an external > > optimization process being in place... > > > > What do you think? > > I can't think of a scenario where someone would add weight_set (via > crush optimize or otherwise) and want a default weight set for a newly > added OSD other than zero. The problem I still see if that if you have something like 4 root 2 rack1 1 host1 1 host2 2 rack2 1 host3 1 host4 and then move host3 into rack1, zeroing means you get 3 root 2 rack1 1 host1 1 host2 0 host3 1 rack2 1 host4 ...which is going to put 33% more data on host{1,2,4} until the optimizer runs. In that case, I think preserving host3's weight won't be perfect (different pgs)... but it will be much closer than 0! In both cases, the actual data movement is going to be slow, so the optimizer is looking at the 'up' set it can sort out the adjustments before much data actually moves. But it will be a simpler optimization problem (only a few steps, probably) to go from 1.03 -> .98 (for example) than from 0 -> .98. In other words, I think your argument that 'the optimizer will fix it' cuts both ways: it can fix 0 or the canonical weight. And I think that in almost all cases the canonical weight will be closer than 0? > If using crush optimize, rebalancing will start at the root of the rule > down to the bucket containing the new OSD. Since adding the OSD modifies > the weight to the top of the hierarchy, it is likely to need > rebalancing. The most common case is an uneven distribution due to a low > number of PGs and we cannot predict which bucket will over or under > fill. There is no predictible ratio at any level of the hierarchy > because the distribution is uneven in a random way. In general the > variance is lower than 50% but in some cases (CERN cluster for instance) > it is higher. The most predictable thing here is the canonical/target weight. It won't be perfect, but it will be close.. that's why it's worked up until now. > In the case of the multipick anomaly we know that OSDs > with the lowest weights will overfill and those with higher weights will > underfill and we could try to figure out a non zero default weight that > makes sense but that would be tricky. True. In the case of move, preserving the old weights across the most will preserve some of this adjustment. How close it is depends on whether the source and destination buckets are similarly structured. I think the main argument for forcing to 0 in this case is if it is a Very Bad Thing to overshoot. And I don't think that's the case: backfill will stop moving data to a device before it fills up, and if we're talking about filling a device at all that is going to take a long time--1/2 a day at least, probably more--which should be plenty of time for the optimizer to come in and correct it. > > If the weight set are modified with another logic in mind, a zero weight > is not going to disturb it. A contrario, a default weight set based on > the target weight or the ratio between the target weight and the weight > set of other OSDs is unlikely to match whatever rationale was used to > set the weight set. And PGs will likely move back and forth uselessly > because of that. Either way PGs will move--they have to go somewhere: 1- If the moved item's weight is 0, then PGs will move to compensate for the reduced weight at the old location, and go to random other places in the hierarchy. Probably *all* of those PGs will have to move back (or elsewhere) once the item's weight is optimized to its final value. 2- If the moved item's (canonical) weight is too high (higher than the optimized weight we don't know yet), then a few more PGs will move than should have, and will have to move back. How many depends on how much higher we were; if it was 5% to high, then ~5% will move back. 3- If the moved item's (cannoical) weight is too low, then you'll have a blend of the above. But mostly case #2. So I think actually the useless PG movement will be lower with the canonical weight than with 0. (For a *new* OSD, starting at 0 makes sense, though!) > In other words, I came to realize that it is highly unlikely that > someone will try to manually set the weight set and that whatever tool > they are using, a zero weight will always be a sane default. Note that > this *only* applies to weight set, if there are choose args and does not > otherwise change anything if choose args are not set. Oh, right... I was missing was that there is not really a case where users are manually setting these choose_args; if they are using them at all then we can assume they can deal with the fallout from a move. Even so, I think the fallout is smaller with an approximation than with 0. > The big question mark for me is your intuition that choose_args weights > should be scaled proportionaly. It would be great if you could expand on > that so I better understand it. Here I'm talking about the rack{1,2} weights. For this example, say they are 1.96 and 2.04, respetively, before the move. Afterwards, some but not all of the inputs have changed because rack1 still has host{1,2} and rack2 still has host4. So if rack1's canonical weight goes 2 -> 3, that's a 50% increase, and we can scale 1.96 -> 2.94. That's likely to be close to the real optimal value (say, 2.97). ? s -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html