On 22/11/2016, Sage Weil wrote: > Hmm. What if the weights are alternatives based on num_rep, but > rather based on r (or rather which replica we are choosing). So the > first choice is based on the actual weights. The second choice is > based on the r=2 weights. The third choice is based on r=3 weights. > > In a perfect world, those weights are actually a function of our > previous selection(s); in this case, since that is not > computationally feasible (right?), we can precalculate values that > are "average" for possible prior choices. It'd be a bit more > complicated, but assume for a moment that we could do this... So! We /can/ do something like this. A 'cumulative' reweighting, where we calibrate r=n as a correction on r=n-1. The downsides would be that there are fewer 'possible' distributions if we skew per-round. (That is, there are distributions we can achieve adjusting everything up-front that we can't achieve going per-round.) I'm strongly suspect that as our maximum 'r' goes up the difference between possible all-at-once distributions and possible per-round distributions gets bigger. Second, currently every OSD has an equal chance of being the primary, secondary, tertiary... Under this scheme the lowest weight OSDs would be the primary for a disproportionately large share of the objects they hold. If we assume that low-weight OSDs are smaller machines and primaries do more work than secondaries, this might be bad. Third, it would be more computationally intensive. Especially if we want to support a large bucket where each OSD has a unique weight shared by no other OSD. I don't nkow HOW intensive (and there's room to try and make things smarter) but it might be unavoidably quite long for large buckets. If those don't seem too bad as tradeoffs, we can do something along these lines and I'll get to hammering at it. It seems like a fairly large amount of complexity for the problem being solved, but that all depends on how much the Distribution Surprise is biting people, I guess. I /suspect/ we will want to support the old, unmodified mode of operation on a scale finer than a CRUSH tunable. -- Senior Software Engineer Red Hat Storage, Ann Arbor, MI, US IRC: Aemerson@{RedHat, OFTC, Freenode} 0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C 7C12 80F7 544B 90ED BFB9 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html