Hi I have a bit different algorithm of reweighing for multi-pick. It was a bit long, so I decided to not to put on the mail thread, and instead I have uploaded on a GitHub repo [1]. I have explained with examples, and the reasons I think it can solve the problem. I would really appreciate if someone can go through it and suggest if this is viable or not. [1] : https://github.com/SpandanKumarSahu/Ceph_Proposal On Tue, Mar 28, 2017 at 12:22 PM, Adam Kupczyk <akupczyk@xxxxxxxxxxxx> wrote: > "... or simply have a single global set" > > No. Proof by example: > > I once attempted to perfectly balance cluster X by modifying crush weights. > Pool A spanned over 352 OSDs (set A) > Pool B spanned over 176 OSDs (set B, half of A) > The result (simulated perfect balance) was that obtained weights had > - small variance for B (5%), > - small variance for A-B (5%). > - huge variance for A (800%) > This was of course because crush had to be strongly discouraged to > pick from B, when performing placement for A. > > "...crush users can choose..." > For each pool there is only one vector of weights that will provide > perfect balance. (math note: actually multiple of them, but different > by scale) > I cannot at the moment imagine any other practical metrics other then > balancing. But maybe it is just failure of imagination. > > On Mon, Mar 27, 2017 at 3:39 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: >> On Mon, 27 Mar 2017, Adam Kupczyk wrote: >>> Hi, >>> >>> My understanding is that optimal tweaked weights will depend on: >>> 1) pool_id, because of rjenkins(pool_id) in crush >>> 2) number of placement groups and replication factor, as it determines >>> amount of samples >>> >>> Therefore tweaked weights should rather be property of instantialized pool, >>> not crush placement definition. >>> >>> If tweaked weights are to be part of crush definition, than for each >>> created pool we need to have separate list of weights. >>> Is it possible to provide clients with different weights depending on on >>> which pool they want to operate? >> >> As Loic suggested, you can create as many derivative hierarchies in the >> crush map as you like, potentially one per pool. Or you could treat the >> sum total of all pgs as the interesting set, balance those, and get some >> OSDs doing a bit more of one pool than another. The new post-CRUSH OSD >> remap capability can always clean this up (and turn a "good" crush >> distribution into a perfect distribution). >> >> I guess the question is: when we add the explicit adjusted weight matrix >> to crush should we have multiple sets of weights (perhaps one for each >> pool), or simply have a single global set. It might make sense to allow N >> sets of adjusted weights so that the crush users can choose a particular >> set of them for different pools (or whatever it is they're calculating the >> mapping for).. >> >> sage >> >> >>> >>> Best regards, >>> Adam >>> >>> On Mon, Mar 27, 2017 at 10:45 AM, Adam Kupczyk <akupczyk@xxxxxxxxxxxx> wrote: >>> > Hi, >>> > >>> > My understanding is that optimal tweaked weights will depend on: >>> > 1) pool_id, because of rjenkins(pool_id) in crush >>> > 2) number of placement groups and replication factor, as it determines >>> > amount of samples >>> > >>> > Therefore tweaked weights should rather be property of instantialized pool, >>> > not crush placement definition. >>> > >>> > If tweaked weights are to be part of crush definition, than for each created >>> > pool we need to have separate list of weights. >>> > Is it possible to provide clients with different weights depending on on >>> > which pool they want to operate? >>> > >>> > Best regards, >>> > Adam >>> > >>> > >>> > On Mon, Mar 27, 2017 at 8:45 AM, Loic Dachary <loic@xxxxxxxxxxx> wrote: >>> >> >>> >> >>> >> >>> >> On 03/27/2017 04:33 AM, Sage Weil wrote: >>> >> > On Sun, 26 Mar 2017, Adam Kupczyk wrote: >>> >> >> Hello Sage, Loic, Pedro, >>> >> >> >>> >> >> >>> >> >> I am certain that almost perfect mapping can be achieved by >>> >> >> substituting weights from crush map with slightly modified weights. >>> >> >> By perfect mapping I mean we get on each OSD number of PGs exactly >>> >> >> proportional to weights specified in crush map. >>> >> >> >>> >> >> 1. Example >>> >> >> Lets think of PGs of single object pool. >>> >> >> We have OSDs with following weights: >>> >> >> [10, 10, 10, 5, 5] >>> >> >> >>> >> >> Ideally, we would like following distribution of 200PG x 3 copies = 600 >>> >> >> PGcopies : >>> >> >> [150, 150, 150, 75, 75] >>> >> >> >>> >> >> However, because crush simulates random process we have: >>> >> >> [143, 152, 158, 71, 76] >>> >> >> >>> >> >> We could have obtained perfect distribution had we used weights like >>> >> >> this: >>> >> >> [10.2, 9.9, 9.6, 5.2, 4.9] >>> >> >> >>> >> >> >>> >> >> 2. Obtaining perfect mapping weights from OSD capacity weights >>> >> >> >>> >> >> When we apply crush for the first time, distribution of PGs comes as >>> >> >> random. >>> >> >> CRUSH([10, 10, 10, 5, 5]) -> [143, 152, 158, 71, 76] >>> >> >> >>> >> >> But CRUSH is not random proces at all, it behaves in numerically stable >>> >> >> way. >>> >> >> Specifically, if we increase weight on one node, we will get more PGs >>> >> >> on >>> >> >> this node and less on every other node: >>> >> >> CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76] >>> >> >> >>> >> >> Now, finding ideal weights can be done by any numerical minimization >>> >> >> method, >>> >> >> for example NLMS. >>> >> >> >>> >> >> >>> >> >> 3. The proposal >>> >> >> For each pool, from initial weights given in crush map perfect weights >>> >> >> will >>> >> >> be derived. >>> >> >> This weights will be used to calculate PG distribution. This of course >>> >> >> will >>> >> >> be close to perfect. >>> >> >> >>> >> >> 3a: Downside when OSD is out >>> >> >> When an OSD is out, missing PG copies will be replicated elsewhere. >>> >> >> Because now weights deviate from OSD capacity, some OSDs will >>> >> >> statistically >>> >> >> get more copies then they should. >>> >> >> This unevenness in distribution is proportional to scale of deviation >>> >> >> of >>> >> >> calculated weights to capacity weights. >>> >> >> >>> >> >> 3b: Upside >>> >> >> This all can be achieved without changes to crush. >>> >> > >>> >> > Yes! >>> >> > >>> >> > And no. You're totally right--we should use an offline optimization to >>> >> > tweak the crush input weights to get a better balance. It won't be >>> >> > robust >>> >> > to changes to the cluster, but we can incrementally optimize after that >>> >> > happens to converge on something better. >>> >> > >>> >> > The problem with doing this with current versions of Ceph is that we >>> >> > lose >>> >> > the original "input" or "target" weights (i.e., the actual size of >>> >> > the OSD) that we want to converge on. This is one reason why we haven't >>> >> > done something like this before. >>> >> > >>> >> > In luminous we *could* work around this by storing those canonical >>> >> > weights outside of crush using something (probably?) ugly and >>> >> > maintain backward compatibility with older clients using existing >>> >> > CRUSH behavior. >>> >> >>> >> These canonical weights could be stored in crush by creating dedicated >>> >> buckets. For instance the root-canonical bucket could be created to store >>> >> the canonical weights of the root bucket. The sysadmin needs to be aware of >>> >> the difference and know to add a new device in the host01-canonical bucket >>> >> instead of the host01 bucket. And to run an offline tool to keep the two >>> >> buckets in sync and compute the weight to use for placement derived from the >>> >> weights representing the device capacity. >>> >> >>> >> It is a little bit ugly ;-) >>> >> >>> >> > OR, (and this is my preferred route), if the multi-pick anomaly approach >>> >> > that Pedro is working on works out, we'll want to extend the CRUSH map >>> >> > to >>> >> > include a set of derivative weights used for actual placement >>> >> > calculations >>> >> > instead of the canonical target weights, and we can do what you're >>> >> > proposing *and* solve the multipick problem with one change in the crush >>> >> > map and algorithm. (Actually choosing those derivative weights will >>> >> > be an offline process that can both improve the balance for the inputs >>> >> > we >>> >> > care about *and* adjust them based on the position to fix the skew issue >>> >> > for replicas.) This doesn't help pre-luminous clients, but I think the >>> >> > end solution will be simpler and more elegant... >>> >> > >>> >> > What do you think? >>> >> > >>> >> > sage >>> >> > >>> >> > >>> >> >> 4. Extra >>> >> >> Some time ago I made such change to perfectly balance Thomson-Reuters >>> >> >> cluster. >>> >> >> It succeeded. >>> >> >> A solution was not accepted, because modification of OSD weights were >>> >> >> higher >>> >> >> then 50%, which was caused by fact that different placement rules >>> >> >> operated >>> >> >> on different sets of OSDs, and those sets were not disjointed. >>> >> > >>> >> > >>> >> >> >>> >> >> Best regards, >>> >> >> Adam >>> >> >> >>> >> >> >>> >> >> On Sat, Mar 25, 2017 at 7:42 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: >>> >> >> Hi Pedro, Loic, >>> >> >> >>> >> >> For what it's worth, my intuition here (which has had a mixed >>> >> >> record as >>> >> >> far as CRUSH goes) is that this is the most promising path >>> >> >> forward. >>> >> >> >>> >> >> Thinking ahead a few steps, and confirming that I'm following >>> >> >> the >>> >> >> discussion so far, if you're able to do get black (or white) box >>> >> >> gradient >>> >> >> descent to work, then this will give us a set of weights for >>> >> >> each item in >>> >> >> the tree for each selection round, derived from the tree >>> >> >> structure and >>> >> >> original (target) weights. That would basically give us a map >>> >> >> of item id >>> >> >> (bucket id or leave item id) to weight for each round. i.e., >>> >> >> >>> >> >> map<int, map<int, float>> weight_by_position; // position -> >>> >> >> item -> weight >>> >> >> >>> >> >> where the 0 round would (I think?) match the target weights, and >>> >> >> each >>> >> >> round after that would skew low-weighted items lower to some >>> >> >> degree. >>> >> >> Right? >>> >> >> >>> >> >> The next question I have is: does this generalize from the >>> >> >> single-bucket >>> >> >> case to the hierarchy? I.e., if I have a "tree" (single bucket) >>> >> >> like >>> >> >> >>> >> >> 3.1 >>> >> >> |_____________ >>> >> >> | \ \ \ >>> >> >> 1.0 1.0 1.0 .1 >>> >> >> >>> >> >> it clearly works, but when we have a multi-level tree like >>> >> >> >>> >> >> >>> >> >> 8.4 >>> >> >> |____________________________________ >>> >> >> | \ \ >>> >> >> 3.1 3.1 2.2 >>> >> >> |_____________ |_____________ |_____________ >>> >> >> | \ \ \ | \ \ \ | \ \ \ >>> >> >> 1.0 1.0 1.0 .1 1.0 1.0 1.0 .1 1.0 1.0 .1 .1 >>> >> >> >>> >> >> and the second round weights skew the small .1 leaves lower, can >>> >> >> we >>> >> >> continue to build the summed-weight hierarchy, such that the >>> >> >> adjusted >>> >> >> weights at the higher level are appropriately adjusted to give >>> >> >> us the >>> >> >> right probabilities of descending into those trees? I'm not >>> >> >> sure if that >>> >> >> logically follows from the above or if my intuition is >>> >> >> oversimplifying >>> >> >> things. >>> >> >> >>> >> >> If this *is* how we think this will shake out, then I'm >>> >> >> wondering if we >>> >> >> should go ahead and build this weigh matrix into CRUSH sooner >>> >> >> rather >>> >> >> than later (i.e., for luminous). As with the explicit >>> >> >> remappings, the >>> >> >> hard part is all done offline, and the adjustments to the CRUSH >>> >> >> mapping >>> >> >> calculation itself (storing and making use of the adjusted >>> >> >> weights for >>> >> >> each round of placement) are relatively straightforward. And >>> >> >> the sooner >>> >> >> this is incorporated into a release the sooner real users will >>> >> >> be able to >>> >> >> roll out code to all clients and start making us of it. >>> >> >> >>> >> >> Thanks again for looking at this problem! I'm excited that we >>> >> >> may be >>> >> >> closing in on a real solution! >>> >> >> >>> >> >> sage >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> On Thu, 23 Mar 2017, Pedro López-Adeva wrote: >>> >> >> >>> >> >> > There are lot of gradient-free methods. I will try first to >>> >> >> run the >>> >> >> > ones available using just scipy >>> >> >> > >>> >> >> >>> >> >> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html). >>> >> >> > Some of them don't require the gradient and some of them can >>> >> >> estimate >>> >> >> > it. The reason to go without the gradient is to run the CRUSH >>> >> >> > algorithm as a black box. In that case this would be the >>> >> >> pseudo-code: >>> >> >> > >>> >> >> > - BEGIN CODE - >>> >> >> > def build_target(desired_freqs): >>> >> >> > def target(weights): >>> >> >> > # run a simulation of CRUSH for a number of objects >>> >> >> > sim_freqs = run_crush(weights) >>> >> >> > # Kullback-Leibler divergence between desired >>> >> >> frequencies and >>> >> >> > current ones >>> >> >> > return loss(sim_freqs, desired_freqs) >>> >> >> > return target >>> >> >> > >>> >> >> > weights = scipy.optimize.minimize(build_target(desired_freqs)) >>> >> >> > - END CODE - >>> >> >> > >>> >> >> > The tricky thing here is that this procedure can be slow if >>> >> >> the >>> >> >> > simulation (run_crush) needs to place a lot of objects to get >>> >> >> accurate >>> >> >> > simulated frequencies. This is true specially if the minimize >>> >> >> method >>> >> >> > attempts to approximate the gradient using finite differences >>> >> >> since it >>> >> >> > will evaluate the target function a number of times >>> >> >> proportional to >>> >> >> > the number of weights). Apart from the ones in scipy I would >>> >> >> try also >>> >> >> > optimization methods that try to perform as few evaluations as >>> >> >> > possible like for example HyperOpt >>> >> >> > (http://hyperopt.github.io/hyperopt/), which by the way takes >>> >> >> into >>> >> >> > account that the target function can be noisy. >>> >> >> > >>> >> >> > This black box approximation is simple to implement and makes >>> >> >> the >>> >> >> > computer do all the work instead of us. >>> >> >> > I think that this black box approximation is worthy to try >>> >> >> even if >>> >> >> > it's not the final one because if this approximation works >>> >> >> then we >>> >> >> > know that a more elaborate one that computes the gradient of >>> >> >> the CRUSH >>> >> >> > algorithm will work for sure. >>> >> >> > >>> >> >> > I can try this black box approximation this weekend not on the >>> >> >> real >>> >> >> > CRUSH algorithm but with the simple implementation I did in >>> >> >> python. If >>> >> >> > it works it's just a matter of substituting one simulation >>> >> >> with >>> >> >> > another and see what happens. >>> >> >> > >>> >> >> > 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@xxxxxxxxxxx>: >>> >> >> > > Hi Pedro, >>> >> >> > > >>> >> >> > > On 03/23/2017 12:49 PM, Pedro López-Adeva wrote: >>> >> >> > >> Hi Loic, >>> >> >> > >> >>> >> >> > >>>From what I see everything seems OK. >>> >> >> > > >>> >> >> > > Cool. I'll keep going in this direction then ! >>> >> >> > > >>> >> >> > >> The interesting thing would be to >>> >> >> > >> test on some complex mapping. The reason is that >>> >> >> "CrushPolicyFamily" >>> >> >> > >> is right now modeling just a single straw bucket not the >>> >> >> full CRUSH >>> >> >> > >> algorithm. >>> >> >> > > >>> >> >> > > A number of use cases use a single straw bucket, maybe the >>> >> >> majority of them. Even though it does not reflect the full range >>> >> >> of what crush can offer, it could be useful. To be more >>> >> >> specific, a crush map that states "place objects so that there >>> >> >> is at most one replica per host" or "one replica per rack" is >>> >> >> common. Such a crushmap can be reduced to a single straw bucket >>> >> >> that contains all the hosts and by using the CrushPolicyFamily, >>> >> >> we can change the weights of each host to fix the probabilities. >>> >> >> The hosts themselves contain disks with varying weights but I >>> >> >> think we can ignore that because crush will only recurse to >>> >> >> place one object within a given host. >>> >> >> > > >>> >> >> > >> That's the work that remains to be done. The only way that >>> >> >> > >> would avoid reimplementing the CRUSH algorithm and >>> >> >> computing the >>> >> >> > >> gradient would be treating CRUSH as a black box and >>> >> >> eliminating the >>> >> >> > >> necessity of computing the gradient either by using a >>> >> >> gradient-free >>> >> >> > >> optimization method or making an estimation of the >>> >> >> gradient. >>> >> >> > > >>> >> >> > > By gradient-free optimization you mean simulated annealing >>> >> >> or Monte Carlo ? >>> >> >> > > >>> >> >> > > Cheers >>> >> >> > > >>> >> >> > >> >>> >> >> > >> >>> >> >> > >> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@xxxxxxxxxxx>: >>> >> >> > >>> Hi, >>> >> >> > >>> >>> >> >> > >>> I modified the crush library to accept two weights (one >>> >> >> for the first disk, the other for the remaining disks)[1]. This >>> >> >> really is a hack for experimentation purposes only ;-) I was >>> >> >> able to run a variation of your code[2] and got the following >>> >> >> results which are encouraging. Do you think what I did is >>> >> >> sensible ? Or is there a problem I don't see ? >>> >> >> > >>> >>> >> >> > >>> Thanks ! >>> >> >> > >>> >>> >> >> > >>> Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8 >>> >> >> 6] >>> >> >> > >>> >>> >> >> >>> >> >> ------------------------------------------------------------------------ >>> >> >> > >>> Before: All replicas on each hard drive >>> >> >> > >>> Expected vs actual use (20000 samples) >>> >> >> > >>> disk 0: 1.39e-01 1.12e-01 >>> >> >> > >>> disk 1: 1.11e-01 1.10e-01 >>> >> >> > >>> disk 2: 8.33e-02 1.13e-01 >>> >> >> > >>> disk 3: 1.39e-01 1.11e-01 >>> >> >> > >>> disk 4: 1.11e-01 1.11e-01 >>> >> >> > >>> disk 5: 8.33e-02 1.11e-01 >>> >> >> > >>> disk 6: 1.39e-01 1.12e-01 >>> >> >> > >>> disk 7: 1.11e-01 1.12e-01 >>> >> >> > >>> disk 8: 8.33e-02 1.10e-01 >>> >> >> > >>> it= 1 jac norm=1.59e-01 loss=5.27e-03 >>> >> >> > >>> it= 2 jac norm=1.55e-01 loss=5.03e-03 >>> >> >> > >>> ... >>> >> >> > >>> it= 212 jac norm=1.02e-03 loss=2.41e-07 >>> >> >> > >>> it= 213 jac norm=1.00e-03 loss=2.31e-07 >>> >> >> > >>> Converged to desired accuracy :) >>> >> >> > >>> After: All replicas on each hard drive >>> >> >> > >>> Expected vs actual use (20000 samples) >>> >> >> > >>> disk 0: 1.39e-01 1.42e-01 >>> >> >> > >>> disk 1: 1.11e-01 1.09e-01 >>> >> >> > >>> disk 2: 8.33e-02 8.37e-02 >>> >> >> > >>> disk 3: 1.39e-01 1.40e-01 >>> >> >> > >>> disk 4: 1.11e-01 1.13e-01 >>> >> >> > >>> disk 5: 8.33e-02 8.08e-02 >>> >> >> > >>> disk 6: 1.39e-01 1.38e-01 >>> >> >> > >>> disk 7: 1.11e-01 1.09e-01 >>> >> >> > >>> disk 8: 8.33e-02 8.48e-02 >>> >> >> > >>> >>> >> >> > >>> >>> >> >> > >>> Simulation: R=2 devices capacity [10 10 10 10 1] >>> >> >> > >>> >>> >> >> >>> >> >> ------------------------------------------------------------------------ >>> >> >> > >>> Before: All replicas on each hard drive >>> >> >> > >>> Expected vs actual use (20000 samples) >>> >> >> > >>> disk 0: 2.44e-01 2.36e-01 >>> >> >> > >>> disk 1: 2.44e-01 2.38e-01 >>> >> >> > >>> disk 2: 2.44e-01 2.34e-01 >>> >> >> > >>> disk 3: 2.44e-01 2.38e-01 >>> >> >> > >>> disk 4: 2.44e-02 5.37e-02 >>> >> >> > >>> it= 1 jac norm=2.43e-01 loss=2.98e-03 >>> >> >> > >>> it= 2 jac norm=2.28e-01 loss=2.47e-03 >>> >> >> > >>> ... >>> >> >> > >>> it= 37 jac norm=1.28e-03 loss=3.48e-08 >>> >> >> > >>> it= 38 jac norm=1.07e-03 loss=2.42e-08 >>> >> >> > >>> Converged to desired accuracy :) >>> >> >> > >>> After: All replicas on each hard drive >>> >> >> > >>> Expected vs actual use (20000 samples) >>> >> >> > >>> disk 0: 2.44e-01 2.46e-01 >>> >> >> > >>> disk 1: 2.44e-01 2.44e-01 >>> >> >> > >>> disk 2: 2.44e-01 2.41e-01 >>> >> >> > >>> disk 3: 2.44e-01 2.45e-01 >>> >> >> > >>> disk 4: 2.44e-02 2.33e-02 >>> >> >> > >>> >>> >> >> > >>> >>> >> >> > >>> [1] crush >>> >> >> hackhttp://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd >>> >> >> 56fee8 >>> >> >> > >>> [2] python-crush >>> >> >> hackhttp://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1 >>> >> >> bd25f8f2c4b68 >>> >> >> > >>> >>> >> >> > >>> On 03/19/2017 11:31 PM, Loic Dachary wrote: >>> >> >> > >>>> Hi Pedro, >>> >> >> > >>>> >>> >> >> > >>>> It looks like trying to experiment with crush won't work >>> >> >> as expected because crush does not distinguish the probability >>> >> >> of selecting the first device from the probability of selecting >>> >> >> the second or third device. Am I mistaken ? >>> >> >> > >>>> >>> >> >> > >>>> Cheers >>> >> >> > >>>> >>> >> >> > >>>> On 03/18/2017 10:21 AM, Loic Dachary wrote: >>> >> >> > >>>>> Hi Pedro, >>> >> >> > >>>>> >>> >> >> > >>>>> I'm going to experiment with what you did at >>> >> >> > >>>>> >>> >> >> > >>>>> >>> >> >> https://github.com/plafl/notebooks/blob/master/replication.ipynb >>> >> >> > >>>>> >>> >> >> > >>>>> and the latest python-crush published today. A >>> >> >> comparison function was added that will help measure the data >>> >> >> movement. I'm hoping we can release an offline tool based on >>> >> >> your solution. Please let me know if I should wait before diving >>> >> >> into this, in case you have unpublished drafts or new ideas. >>> >> >> > >>>>> >>> >> >> > >>>>> Cheers >>> >> >> > >>>>> >>> >> >> > >>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote: >>> >> >> > >>>>>> Great, thanks for the clarifications. >>> >> >> > >>>>>> I also think that the most natural way is to keep just >>> >> >> a set of >>> >> >> > >>>>>> weights in the CRUSH map and update them inside the >>> >> >> algorithm. >>> >> >> > >>>>>> >>> >> >> > >>>>>> I keep working on it. >>> >> >> > >>>>>> >>> >> >> > >>>>>> >>> >> >> > >>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil >>> >> >> <sage@xxxxxxxxxxxx>: >>> >> >> > >>>>>>> Hi Pedro, >>> >> >> > >>>>>>> >>> >> >> > >>>>>>> Thanks for taking a look at this! It's a frustrating >>> >> >> problem and we >>> >> >> > >>>>>>> haven't made much headway. >>> >> >> > >>>>>>> >>> >> >> > >>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote: >>> >> >> > >>>>>>>> Hi, >>> >> >> > >>>>>>>> >>> >> >> > >>>>>>>> I will have a look. BTW, I have not progressed that >>> >> >> much but I have >>> >> >> > >>>>>>>> been thinking about it. In order to adapt the >>> >> >> previous algorithm in >>> >> >> > >>>>>>>> the python notebook I need to substitute the >>> >> >> iteration over all >>> >> >> > >>>>>>>> possible devices permutations to iteration over all >>> >> >> the possible >>> >> >> > >>>>>>>> selections that crush would make. That is the main >>> >> >> thing I need to >>> >> >> > >>>>>>>> work on. >>> >> >> > >>>>>>>> >>> >> >> > >>>>>>>> The other thing is of course that weights change for >>> >> >> each replica. >>> >> >> > >>>>>>>> That is, they cannot be really fixed in the crush >>> >> >> map. So the >>> >> >> > >>>>>>>> algorithm inside libcrush, not only the weights in >>> >> >> the map, need to be >>> >> >> > >>>>>>>> changed. The weights in the crush map should reflect >>> >> >> then, maybe, the >>> >> >> > >>>>>>>> desired usage frequencies. Or maybe each replica >>> >> >> should have their own >>> >> >> > >>>>>>>> crush map, but then the information about the >>> >> >> previous selection >>> >> >> > >>>>>>>> should be passed to the next replica placement run so >>> >> >> it avoids >>> >> >> > >>>>>>>> selecting the same one again. >>> >> >> > >>>>>>> >>> >> >> > >>>>>>> My suspicion is that the best solution here (whatever >>> >> >> that means!) >>> >> >> > >>>>>>> leaves the CRUSH weights intact with the desired >>> >> >> distribution, and >>> >> >> > >>>>>>> then generates a set of derivative weights--probably >>> >> >> one set for each >>> >> >> > >>>>>>> round/replica/rank. >>> >> >> > >>>>>>> >>> >> >> > >>>>>>> One nice property of this is that once the support is >>> >> >> added to encode >>> >> >> > >>>>>>> multiple sets of weights, the algorithm used to >>> >> >> generate them is free to >>> >> >> > >>>>>>> change and evolve independently. (In most cases any >>> >> >> change is >>> >> >> > >>>>>>> CRUSH's mapping behavior is difficult to roll out >>> >> >> because all >>> >> >> > >>>>>>> parties participating in the cluster have to support >>> >> >> any new behavior >>> >> >> > >>>>>>> before it is enabled or used.) >>> >> >> > >>>>>>> >>> >> >> > >>>>>>>> I have a question also. Is there any significant >>> >> >> difference between >>> >> >> > >>>>>>>> the device selection algorithm description in the >>> >> >> paper and its final >>> >> >> > >>>>>>>> implementation? >>> >> >> > >>>>>>> >>> >> >> > >>>>>>> The main difference is the "retry_bucket" behavior was >>> >> >> found to be a bad >>> >> >> > >>>>>>> idea; any collision or failed()/overload() case >>> >> >> triggers the >>> >> >> > >>>>>>> retry_descent. >>> >> >> > >>>>>>> >>> >> >> > >>>>>>> There are other changes, of course, but I don't think >>> >> >> they'll impact any >>> >> >> > >>>>>>> solution we come with here (or at least any solution >>> >> >> can be suitably >>> >> >> > >>>>>>> adapted)! >>> >> >> > >>>>>>> >>> >> >> > >>>>>>> sage >>> >> >> > >>>>>> -- >>> >> >> > >>>>>> To unsubscribe from this list: send the line >>> >> >> "unsubscribe ceph-devel" in >>> >> >> > >>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> >> >> > >>>>>> More majordomo info at >>> >> >> http://vger.kernel.org/majordomo-info.html >>> >> >> > >>>>>> >>> >> >> > >>>>> >>> >> >> > >>>> >>> >> >> > >>> >>> >> >> > >>> -- >>> >> >> > >>> Loïc Dachary, Artisan Logiciel Libre >>> >> >> > >> -- >>> >> >> > >> To unsubscribe from this list: send the line "unsubscribe >>> >> >> ceph-devel" in >>> >> >> > >> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> >> >> > >> More majordomo info at >>> >> >> http://vger.kernel.org/majordomo-info.html >>> >> >> > >> >>> >> >> > > >>> >> >> > > -- >>> >> >> > > Loïc Dachary, Artisan Logiciel Libre >>> >> >> > -- >>> >> >> > To unsubscribe from this list: send the line "unsubscribe >>> >> >> ceph-devel" in >>> >> >> > the body of a message to majordomo@xxxxxxxxxxxxxxx >>> >> >> > More majordomo info at >>> >> >> http://vger.kernel.org/majordomo-info.html >>> >> >> > >>> >> >> > >>> >> >> >>> >> >> >>> >> >> >>> >> >>> >> -- >>> >> Loïc Dachary, Artisan Logiciel Libre >>> > >>> > >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Spandan Kumar Sahu IIT Kharagpur -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html