"... or simply have a single global set" No. Proof by example: I once attempted to perfectly balance cluster X by modifying crush weights. Pool A spanned over 352 OSDs (set A) Pool B spanned over 176 OSDs (set B, half of A) The result (simulated perfect balance) was that obtained weights had - small variance for B (5%), - small variance for A-B (5%). - huge variance for A (800%) This was of course because crush had to be strongly discouraged to pick from B, when performing placement for A. "...crush users can choose..." For each pool there is only one vector of weights that will provide perfect balance. (math note: actually multiple of them, but different by scale) I cannot at the moment imagine any other practical metrics other then balancing. But maybe it is just failure of imagination. On Mon, Mar 27, 2017 at 3:39 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > On Mon, 27 Mar 2017, Adam Kupczyk wrote: >> Hi, >> >> My understanding is that optimal tweaked weights will depend on: >> 1) pool_id, because of rjenkins(pool_id) in crush >> 2) number of placement groups and replication factor, as it determines >> amount of samples >> >> Therefore tweaked weights should rather be property of instantialized pool, >> not crush placement definition. >> >> If tweaked weights are to be part of crush definition, than for each >> created pool we need to have separate list of weights. >> Is it possible to provide clients with different weights depending on on >> which pool they want to operate? > > As Loic suggested, you can create as many derivative hierarchies in the > crush map as you like, potentially one per pool. Or you could treat the > sum total of all pgs as the interesting set, balance those, and get some > OSDs doing a bit more of one pool than another. The new post-CRUSH OSD > remap capability can always clean this up (and turn a "good" crush > distribution into a perfect distribution). > > I guess the question is: when we add the explicit adjusted weight matrix > to crush should we have multiple sets of weights (perhaps one for each > pool), or simply have a single global set. It might make sense to allow N > sets of adjusted weights so that the crush users can choose a particular > set of them for different pools (or whatever it is they're calculating the > mapping for).. > > sage > > >> >> Best regards, >> Adam >> >> On Mon, Mar 27, 2017 at 10:45 AM, Adam Kupczyk <akupczyk@xxxxxxxxxxxx> wrote: >> > Hi, >> > >> > My understanding is that optimal tweaked weights will depend on: >> > 1) pool_id, because of rjenkins(pool_id) in crush >> > 2) number of placement groups and replication factor, as it determines >> > amount of samples >> > >> > Therefore tweaked weights should rather be property of instantialized pool, >> > not crush placement definition. >> > >> > If tweaked weights are to be part of crush definition, than for each created >> > pool we need to have separate list of weights. >> > Is it possible to provide clients with different weights depending on on >> > which pool they want to operate? >> > >> > Best regards, >> > Adam >> > >> > >> > On Mon, Mar 27, 2017 at 8:45 AM, Loic Dachary <loic@xxxxxxxxxxx> wrote: >> >> >> >> >> >> >> >> On 03/27/2017 04:33 AM, Sage Weil wrote: >> >> > On Sun, 26 Mar 2017, Adam Kupczyk wrote: >> >> >> Hello Sage, Loic, Pedro, >> >> >> >> >> >> >> >> >> I am certain that almost perfect mapping can be achieved by >> >> >> substituting weights from crush map with slightly modified weights. >> >> >> By perfect mapping I mean we get on each OSD number of PGs exactly >> >> >> proportional to weights specified in crush map. >> >> >> >> >> >> 1. Example >> >> >> Lets think of PGs of single object pool. >> >> >> We have OSDs with following weights: >> >> >> [10, 10, 10, 5, 5] >> >> >> >> >> >> Ideally, we would like following distribution of 200PG x 3 copies = 600 >> >> >> PGcopies : >> >> >> [150, 150, 150, 75, 75] >> >> >> >> >> >> However, because crush simulates random process we have: >> >> >> [143, 152, 158, 71, 76] >> >> >> >> >> >> We could have obtained perfect distribution had we used weights like >> >> >> this: >> >> >> [10.2, 9.9, 9.6, 5.2, 4.9] >> >> >> >> >> >> >> >> >> 2. Obtaining perfect mapping weights from OSD capacity weights >> >> >> >> >> >> When we apply crush for the first time, distribution of PGs comes as >> >> >> random. >> >> >> CRUSH([10, 10, 10, 5, 5]) -> [143, 152, 158, 71, 76] >> >> >> >> >> >> But CRUSH is not random proces at all, it behaves in numerically stable >> >> >> way. >> >> >> Specifically, if we increase weight on one node, we will get more PGs >> >> >> on >> >> >> this node and less on every other node: >> >> >> CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76] >> >> >> >> >> >> Now, finding ideal weights can be done by any numerical minimization >> >> >> method, >> >> >> for example NLMS. >> >> >> >> >> >> >> >> >> 3. The proposal >> >> >> For each pool, from initial weights given in crush map perfect weights >> >> >> will >> >> >> be derived. >> >> >> This weights will be used to calculate PG distribution. This of course >> >> >> will >> >> >> be close to perfect. >> >> >> >> >> >> 3a: Downside when OSD is out >> >> >> When an OSD is out, missing PG copies will be replicated elsewhere. >> >> >> Because now weights deviate from OSD capacity, some OSDs will >> >> >> statistically >> >> >> get more copies then they should. >> >> >> This unevenness in distribution is proportional to scale of deviation >> >> >> of >> >> >> calculated weights to capacity weights. >> >> >> >> >> >> 3b: Upside >> >> >> This all can be achieved without changes to crush. >> >> > >> >> > Yes! >> >> > >> >> > And no. You're totally right--we should use an offline optimization to >> >> > tweak the crush input weights to get a better balance. It won't be >> >> > robust >> >> > to changes to the cluster, but we can incrementally optimize after that >> >> > happens to converge on something better. >> >> > >> >> > The problem with doing this with current versions of Ceph is that we >> >> > lose >> >> > the original "input" or "target" weights (i.e., the actual size of >> >> > the OSD) that we want to converge on. This is one reason why we haven't >> >> > done something like this before. >> >> > >> >> > In luminous we *could* work around this by storing those canonical >> >> > weights outside of crush using something (probably?) ugly and >> >> > maintain backward compatibility with older clients using existing >> >> > CRUSH behavior. >> >> >> >> These canonical weights could be stored in crush by creating dedicated >> >> buckets. For instance the root-canonical bucket could be created to store >> >> the canonical weights of the root bucket. The sysadmin needs to be aware of >> >> the difference and know to add a new device in the host01-canonical bucket >> >> instead of the host01 bucket. And to run an offline tool to keep the two >> >> buckets in sync and compute the weight to use for placement derived from the >> >> weights representing the device capacity. >> >> >> >> It is a little bit ugly ;-) >> >> >> >> > OR, (and this is my preferred route), if the multi-pick anomaly approach >> >> > that Pedro is working on works out, we'll want to extend the CRUSH map >> >> > to >> >> > include a set of derivative weights used for actual placement >> >> > calculations >> >> > instead of the canonical target weights, and we can do what you're >> >> > proposing *and* solve the multipick problem with one change in the crush >> >> > map and algorithm. (Actually choosing those derivative weights will >> >> > be an offline process that can both improve the balance for the inputs >> >> > we >> >> > care about *and* adjust them based on the position to fix the skew issue >> >> > for replicas.) This doesn't help pre-luminous clients, but I think the >> >> > end solution will be simpler and more elegant... >> >> > >> >> > What do you think? >> >> > >> >> > sage >> >> > >> >> > >> >> >> 4. Extra >> >> >> Some time ago I made such change to perfectly balance Thomson-Reuters >> >> >> cluster. >> >> >> It succeeded. >> >> >> A solution was not accepted, because modification of OSD weights were >> >> >> higher >> >> >> then 50%, which was caused by fact that different placement rules >> >> >> operated >> >> >> on different sets of OSDs, and those sets were not disjointed. >> >> > >> >> > >> >> >> >> >> >> Best regards, >> >> >> Adam >> >> >> >> >> >> >> >> >> On Sat, Mar 25, 2017 at 7:42 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: >> >> >> Hi Pedro, Loic, >> >> >> >> >> >> For what it's worth, my intuition here (which has had a mixed >> >> >> record as >> >> >> far as CRUSH goes) is that this is the most promising path >> >> >> forward. >> >> >> >> >> >> Thinking ahead a few steps, and confirming that I'm following >> >> >> the >> >> >> discussion so far, if you're able to do get black (or white) box >> >> >> gradient >> >> >> descent to work, then this will give us a set of weights for >> >> >> each item in >> >> >> the tree for each selection round, derived from the tree >> >> >> structure and >> >> >> original (target) weights. That would basically give us a map >> >> >> of item id >> >> >> (bucket id or leave item id) to weight for each round. i.e., >> >> >> >> >> >> map<int, map<int, float>> weight_by_position; // position -> >> >> >> item -> weight >> >> >> >> >> >> where the 0 round would (I think?) match the target weights, and >> >> >> each >> >> >> round after that would skew low-weighted items lower to some >> >> >> degree. >> >> >> Right? >> >> >> >> >> >> The next question I have is: does this generalize from the >> >> >> single-bucket >> >> >> case to the hierarchy? I.e., if I have a "tree" (single bucket) >> >> >> like >> >> >> >> >> >> 3.1 >> >> >> |_____________ >> >> >> | \ \ \ >> >> >> 1.0 1.0 1.0 .1 >> >> >> >> >> >> it clearly works, but when we have a multi-level tree like >> >> >> >> >> >> >> >> >> 8.4 >> >> >> |____________________________________ >> >> >> | \ \ >> >> >> 3.1 3.1 2.2 >> >> >> |_____________ |_____________ |_____________ >> >> >> | \ \ \ | \ \ \ | \ \ \ >> >> >> 1.0 1.0 1.0 .1 1.0 1.0 1.0 .1 1.0 1.0 .1 .1 >> >> >> >> >> >> and the second round weights skew the small .1 leaves lower, can >> >> >> we >> >> >> continue to build the summed-weight hierarchy, such that the >> >> >> adjusted >> >> >> weights at the higher level are appropriately adjusted to give >> >> >> us the >> >> >> right probabilities of descending into those trees? I'm not >> >> >> sure if that >> >> >> logically follows from the above or if my intuition is >> >> >> oversimplifying >> >> >> things. >> >> >> >> >> >> If this *is* how we think this will shake out, then I'm >> >> >> wondering if we >> >> >> should go ahead and build this weigh matrix into CRUSH sooner >> >> >> rather >> >> >> than later (i.e., for luminous). As with the explicit >> >> >> remappings, the >> >> >> hard part is all done offline, and the adjustments to the CRUSH >> >> >> mapping >> >> >> calculation itself (storing and making use of the adjusted >> >> >> weights for >> >> >> each round of placement) are relatively straightforward. And >> >> >> the sooner >> >> >> this is incorporated into a release the sooner real users will >> >> >> be able to >> >> >> roll out code to all clients and start making us of it. >> >> >> >> >> >> Thanks again for looking at this problem! I'm excited that we >> >> >> may be >> >> >> closing in on a real solution! >> >> >> >> >> >> sage >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Thu, 23 Mar 2017, Pedro López-Adeva wrote: >> >> >> >> >> >> > There are lot of gradient-free methods. I will try first to >> >> >> run the >> >> >> > ones available using just scipy >> >> >> > >> >> >> >> >> >> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html). >> >> >> > Some of them don't require the gradient and some of them can >> >> >> estimate >> >> >> > it. The reason to go without the gradient is to run the CRUSH >> >> >> > algorithm as a black box. In that case this would be the >> >> >> pseudo-code: >> >> >> > >> >> >> > - BEGIN CODE - >> >> >> > def build_target(desired_freqs): >> >> >> > def target(weights): >> >> >> > # run a simulation of CRUSH for a number of objects >> >> >> > sim_freqs = run_crush(weights) >> >> >> > # Kullback-Leibler divergence between desired >> >> >> frequencies and >> >> >> > current ones >> >> >> > return loss(sim_freqs, desired_freqs) >> >> >> > return target >> >> >> > >> >> >> > weights = scipy.optimize.minimize(build_target(desired_freqs)) >> >> >> > - END CODE - >> >> >> > >> >> >> > The tricky thing here is that this procedure can be slow if >> >> >> the >> >> >> > simulation (run_crush) needs to place a lot of objects to get >> >> >> accurate >> >> >> > simulated frequencies. This is true specially if the minimize >> >> >> method >> >> >> > attempts to approximate the gradient using finite differences >> >> >> since it >> >> >> > will evaluate the target function a number of times >> >> >> proportional to >> >> >> > the number of weights). Apart from the ones in scipy I would >> >> >> try also >> >> >> > optimization methods that try to perform as few evaluations as >> >> >> > possible like for example HyperOpt >> >> >> > (http://hyperopt.github.io/hyperopt/), which by the way takes >> >> >> into >> >> >> > account that the target function can be noisy. >> >> >> > >> >> >> > This black box approximation is simple to implement and makes >> >> >> the >> >> >> > computer do all the work instead of us. >> >> >> > I think that this black box approximation is worthy to try >> >> >> even if >> >> >> > it's not the final one because if this approximation works >> >> >> then we >> >> >> > know that a more elaborate one that computes the gradient of >> >> >> the CRUSH >> >> >> > algorithm will work for sure. >> >> >> > >> >> >> > I can try this black box approximation this weekend not on the >> >> >> real >> >> >> > CRUSH algorithm but with the simple implementation I did in >> >> >> python. If >> >> >> > it works it's just a matter of substituting one simulation >> >> >> with >> >> >> > another and see what happens. >> >> >> > >> >> >> > 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@xxxxxxxxxxx>: >> >> >> > > Hi Pedro, >> >> >> > > >> >> >> > > On 03/23/2017 12:49 PM, Pedro López-Adeva wrote: >> >> >> > >> Hi Loic, >> >> >> > >> >> >> >> > >>>From what I see everything seems OK. >> >> >> > > >> >> >> > > Cool. I'll keep going in this direction then ! >> >> >> > > >> >> >> > >> The interesting thing would be to >> >> >> > >> test on some complex mapping. The reason is that >> >> >> "CrushPolicyFamily" >> >> >> > >> is right now modeling just a single straw bucket not the >> >> >> full CRUSH >> >> >> > >> algorithm. >> >> >> > > >> >> >> > > A number of use cases use a single straw bucket, maybe the >> >> >> majority of them. Even though it does not reflect the full range >> >> >> of what crush can offer, it could be useful. To be more >> >> >> specific, a crush map that states "place objects so that there >> >> >> is at most one replica per host" or "one replica per rack" is >> >> >> common. Such a crushmap can be reduced to a single straw bucket >> >> >> that contains all the hosts and by using the CrushPolicyFamily, >> >> >> we can change the weights of each host to fix the probabilities. >> >> >> The hosts themselves contain disks with varying weights but I >> >> >> think we can ignore that because crush will only recurse to >> >> >> place one object within a given host. >> >> >> > > >> >> >> > >> That's the work that remains to be done. The only way that >> >> >> > >> would avoid reimplementing the CRUSH algorithm and >> >> >> computing the >> >> >> > >> gradient would be treating CRUSH as a black box and >> >> >> eliminating the >> >> >> > >> necessity of computing the gradient either by using a >> >> >> gradient-free >> >> >> > >> optimization method or making an estimation of the >> >> >> gradient. >> >> >> > > >> >> >> > > By gradient-free optimization you mean simulated annealing >> >> >> or Monte Carlo ? >> >> >> > > >> >> >> > > Cheers >> >> >> > > >> >> >> > >> >> >> >> > >> >> >> >> > >> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@xxxxxxxxxxx>: >> >> >> > >>> Hi, >> >> >> > >>> >> >> >> > >>> I modified the crush library to accept two weights (one >> >> >> for the first disk, the other for the remaining disks)[1]. This >> >> >> really is a hack for experimentation purposes only ;-) I was >> >> >> able to run a variation of your code[2] and got the following >> >> >> results which are encouraging. Do you think what I did is >> >> >> sensible ? Or is there a problem I don't see ? >> >> >> > >>> >> >> >> > >>> Thanks ! >> >> >> > >>> >> >> >> > >>> Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8 >> >> >> 6] >> >> >> > >>> >> >> >> >> >> >> ------------------------------------------------------------------------ >> >> >> > >>> Before: All replicas on each hard drive >> >> >> > >>> Expected vs actual use (20000 samples) >> >> >> > >>> disk 0: 1.39e-01 1.12e-01 >> >> >> > >>> disk 1: 1.11e-01 1.10e-01 >> >> >> > >>> disk 2: 8.33e-02 1.13e-01 >> >> >> > >>> disk 3: 1.39e-01 1.11e-01 >> >> >> > >>> disk 4: 1.11e-01 1.11e-01 >> >> >> > >>> disk 5: 8.33e-02 1.11e-01 >> >> >> > >>> disk 6: 1.39e-01 1.12e-01 >> >> >> > >>> disk 7: 1.11e-01 1.12e-01 >> >> >> > >>> disk 8: 8.33e-02 1.10e-01 >> >> >> > >>> it= 1 jac norm=1.59e-01 loss=5.27e-03 >> >> >> > >>> it= 2 jac norm=1.55e-01 loss=5.03e-03 >> >> >> > >>> ... >> >> >> > >>> it= 212 jac norm=1.02e-03 loss=2.41e-07 >> >> >> > >>> it= 213 jac norm=1.00e-03 loss=2.31e-07 >> >> >> > >>> Converged to desired accuracy :) >> >> >> > >>> After: All replicas on each hard drive >> >> >> > >>> Expected vs actual use (20000 samples) >> >> >> > >>> disk 0: 1.39e-01 1.42e-01 >> >> >> > >>> disk 1: 1.11e-01 1.09e-01 >> >> >> > >>> disk 2: 8.33e-02 8.37e-02 >> >> >> > >>> disk 3: 1.39e-01 1.40e-01 >> >> >> > >>> disk 4: 1.11e-01 1.13e-01 >> >> >> > >>> disk 5: 8.33e-02 8.08e-02 >> >> >> > >>> disk 6: 1.39e-01 1.38e-01 >> >> >> > >>> disk 7: 1.11e-01 1.09e-01 >> >> >> > >>> disk 8: 8.33e-02 8.48e-02 >> >> >> > >>> >> >> >> > >>> >> >> >> > >>> Simulation: R=2 devices capacity [10 10 10 10 1] >> >> >> > >>> >> >> >> >> >> >> ------------------------------------------------------------------------ >> >> >> > >>> Before: All replicas on each hard drive >> >> >> > >>> Expected vs actual use (20000 samples) >> >> >> > >>> disk 0: 2.44e-01 2.36e-01 >> >> >> > >>> disk 1: 2.44e-01 2.38e-01 >> >> >> > >>> disk 2: 2.44e-01 2.34e-01 >> >> >> > >>> disk 3: 2.44e-01 2.38e-01 >> >> >> > >>> disk 4: 2.44e-02 5.37e-02 >> >> >> > >>> it= 1 jac norm=2.43e-01 loss=2.98e-03 >> >> >> > >>> it= 2 jac norm=2.28e-01 loss=2.47e-03 >> >> >> > >>> ... >> >> >> > >>> it= 37 jac norm=1.28e-03 loss=3.48e-08 >> >> >> > >>> it= 38 jac norm=1.07e-03 loss=2.42e-08 >> >> >> > >>> Converged to desired accuracy :) >> >> >> > >>> After: All replicas on each hard drive >> >> >> > >>> Expected vs actual use (20000 samples) >> >> >> > >>> disk 0: 2.44e-01 2.46e-01 >> >> >> > >>> disk 1: 2.44e-01 2.44e-01 >> >> >> > >>> disk 2: 2.44e-01 2.41e-01 >> >> >> > >>> disk 3: 2.44e-01 2.45e-01 >> >> >> > >>> disk 4: 2.44e-02 2.33e-02 >> >> >> > >>> >> >> >> > >>> >> >> >> > >>> [1] crush >> >> >> hackhttp://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd >> >> >> 56fee8 >> >> >> > >>> [2] python-crush >> >> >> hackhttp://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1 >> >> >> bd25f8f2c4b68 >> >> >> > >>> >> >> >> > >>> On 03/19/2017 11:31 PM, Loic Dachary wrote: >> >> >> > >>>> Hi Pedro, >> >> >> > >>>> >> >> >> > >>>> It looks like trying to experiment with crush won't work >> >> >> as expected because crush does not distinguish the probability >> >> >> of selecting the first device from the probability of selecting >> >> >> the second or third device. Am I mistaken ? >> >> >> > >>>> >> >> >> > >>>> Cheers >> >> >> > >>>> >> >> >> > >>>> On 03/18/2017 10:21 AM, Loic Dachary wrote: >> >> >> > >>>>> Hi Pedro, >> >> >> > >>>>> >> >> >> > >>>>> I'm going to experiment with what you did at >> >> >> > >>>>> >> >> >> > >>>>> >> >> >> https://github.com/plafl/notebooks/blob/master/replication.ipynb >> >> >> > >>>>> >> >> >> > >>>>> and the latest python-crush published today. A >> >> >> comparison function was added that will help measure the data >> >> >> movement. I'm hoping we can release an offline tool based on >> >> >> your solution. Please let me know if I should wait before diving >> >> >> into this, in case you have unpublished drafts or new ideas. >> >> >> > >>>>> >> >> >> > >>>>> Cheers >> >> >> > >>>>> >> >> >> > >>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote: >> >> >> > >>>>>> Great, thanks for the clarifications. >> >> >> > >>>>>> I also think that the most natural way is to keep just >> >> >> a set of >> >> >> > >>>>>> weights in the CRUSH map and update them inside the >> >> >> algorithm. >> >> >> > >>>>>> >> >> >> > >>>>>> I keep working on it. >> >> >> > >>>>>> >> >> >> > >>>>>> >> >> >> > >>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil >> >> >> <sage@xxxxxxxxxxxx>: >> >> >> > >>>>>>> Hi Pedro, >> >> >> > >>>>>>> >> >> >> > >>>>>>> Thanks for taking a look at this! It's a frustrating >> >> >> problem and we >> >> >> > >>>>>>> haven't made much headway. >> >> >> > >>>>>>> >> >> >> > >>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote: >> >> >> > >>>>>>>> Hi, >> >> >> > >>>>>>>> >> >> >> > >>>>>>>> I will have a look. BTW, I have not progressed that >> >> >> much but I have >> >> >> > >>>>>>>> been thinking about it. In order to adapt the >> >> >> previous algorithm in >> >> >> > >>>>>>>> the python notebook I need to substitute the >> >> >> iteration over all >> >> >> > >>>>>>>> possible devices permutations to iteration over all >> >> >> the possible >> >> >> > >>>>>>>> selections that crush would make. That is the main >> >> >> thing I need to >> >> >> > >>>>>>>> work on. >> >> >> > >>>>>>>> >> >> >> > >>>>>>>> The other thing is of course that weights change for >> >> >> each replica. >> >> >> > >>>>>>>> That is, they cannot be really fixed in the crush >> >> >> map. So the >> >> >> > >>>>>>>> algorithm inside libcrush, not only the weights in >> >> >> the map, need to be >> >> >> > >>>>>>>> changed. The weights in the crush map should reflect >> >> >> then, maybe, the >> >> >> > >>>>>>>> desired usage frequencies. Or maybe each replica >> >> >> should have their own >> >> >> > >>>>>>>> crush map, but then the information about the >> >> >> previous selection >> >> >> > >>>>>>>> should be passed to the next replica placement run so >> >> >> it avoids >> >> >> > >>>>>>>> selecting the same one again. >> >> >> > >>>>>>> >> >> >> > >>>>>>> My suspicion is that the best solution here (whatever >> >> >> that means!) >> >> >> > >>>>>>> leaves the CRUSH weights intact with the desired >> >> >> distribution, and >> >> >> > >>>>>>> then generates a set of derivative weights--probably >> >> >> one set for each >> >> >> > >>>>>>> round/replica/rank. >> >> >> > >>>>>>> >> >> >> > >>>>>>> One nice property of this is that once the support is >> >> >> added to encode >> >> >> > >>>>>>> multiple sets of weights, the algorithm used to >> >> >> generate them is free to >> >> >> > >>>>>>> change and evolve independently. (In most cases any >> >> >> change is >> >> >> > >>>>>>> CRUSH's mapping behavior is difficult to roll out >> >> >> because all >> >> >> > >>>>>>> parties participating in the cluster have to support >> >> >> any new behavior >> >> >> > >>>>>>> before it is enabled or used.) >> >> >> > >>>>>>> >> >> >> > >>>>>>>> I have a question also. Is there any significant >> >> >> difference between >> >> >> > >>>>>>>> the device selection algorithm description in the >> >> >> paper and its final >> >> >> > >>>>>>>> implementation? >> >> >> > >>>>>>> >> >> >> > >>>>>>> The main difference is the "retry_bucket" behavior was >> >> >> found to be a bad >> >> >> > >>>>>>> idea; any collision or failed()/overload() case >> >> >> triggers the >> >> >> > >>>>>>> retry_descent. >> >> >> > >>>>>>> >> >> >> > >>>>>>> There are other changes, of course, but I don't think >> >> >> they'll impact any >> >> >> > >>>>>>> solution we come with here (or at least any solution >> >> >> can be suitably >> >> >> > >>>>>>> adapted)! >> >> >> > >>>>>>> >> >> >> > >>>>>>> sage >> >> >> > >>>>>> -- >> >> >> > >>>>>> To unsubscribe from this list: send the line >> >> >> "unsubscribe ceph-devel" in >> >> >> > >>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >> >> >> > >>>>>> More majordomo info at >> >> >> http://vger.kernel.org/majordomo-info.html >> >> >> > >>>>>> >> >> >> > >>>>> >> >> >> > >>>> >> >> >> > >>> >> >> >> > >>> -- >> >> >> > >>> Loïc Dachary, Artisan Logiciel Libre >> >> >> > >> -- >> >> >> > >> To unsubscribe from this list: send the line "unsubscribe >> >> >> ceph-devel" in >> >> >> > >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> >> >> > >> More majordomo info at >> >> >> http://vger.kernel.org/majordomo-info.html >> >> >> > >> >> >> >> > > >> >> >> > > -- >> >> >> > > Loïc Dachary, Artisan Logiciel Libre >> >> >> > -- >> >> >> > To unsubscribe from this list: send the line "unsubscribe >> >> >> ceph-devel" in >> >> >> > the body of a message to majordomo@xxxxxxxxxxxxxxx >> >> >> > More majordomo info at >> >> >> http://vger.kernel.org/majordomo-info.html >> >> >> > >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> Loïc Dachary, Artisan Logiciel Libre >> > >> > >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html