On Sun, 26 Mar 2017, Adam Kupczyk wrote: > Hello Sage, Loic, Pedro, > > > I am certain that almost perfect mapping can be achieved by > substituting weights from crush map with slightly modified weights. > By perfect mapping I mean we get on each OSD number of PGs exactly > proportional to weights specified in crush map. > > 1. Example > Lets think of PGs of single object pool. > We have OSDs with following weights: > [10, 10, 10, 5, 5] > > Ideally, we would like following distribution of 200PG x 3 copies = 600 > PGcopies : > [150, 150, 150, 75, 75] > > However, because crush simulates random process we have: > [143, 152, 158, 71, 76] > > We could have obtained perfect distribution had we used weights like this: > [10.2, 9.9, 9.6, 5.2, 4.9] > > > 2. Obtaining perfect mapping weights from OSD capacity weights > > When we apply crush for the first time, distribution of PGs comes as random. > CRUSH([10, 10, 10, 5, 5]) -> [143, 152, 158, 71, 76] > > But CRUSH is not random proces at all, it behaves in numerically stable way. > Specifically, if we increase weight on one node, we will get more PGs on > this node and less on every other node: > CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76] > > Now, finding ideal weights can be done by any numerical minimization method, > for example NLMS. > > > 3. The proposal > For each pool, from initial weights given in crush map perfect weights will > be derived. > This weights will be used to calculate PG distribution. This of course will > be close to perfect. > > 3a: Downside when OSD is out > When an OSD is out, missing PG copies will be replicated elsewhere. > Because now weights deviate from OSD capacity, some OSDs will statistically > get more copies then they should. > This unevenness in distribution is proportional to scale of deviation of > calculated weights to capacity weights. > > 3b: Upside > This all can be achieved without changes to crush. Yes! And no. You're totally right--we should use an offline optimization to tweak the crush input weights to get a better balance. It won't be robust to changes to the cluster, but we can incrementally optimize after that happens to converge on something better. The problem with doing this with current versions of Ceph is that we lose the original "input" or "target" weights (i.e., the actual size of the OSD) that we want to converge on. This is one reason why we haven't done something like this before. In luminous we *could* work around this by storing those canonical weights outside of crush using something (probably?) ugly and maintain backward compatibility with older clients using existing CRUSH behavior. OR, (and this is my preferred route), if the multi-pick anomaly approach that Pedro is working on works out, we'll want to extend the CRUSH map to include a set of derivative weights used for actual placement calculations instead of the canonical target weights, and we can do what you're proposing *and* solve the multipick problem with one change in the crush map and algorithm. (Actually choosing those derivative weights will be an offline process that can both improve the balance for the inputs we care about *and* adjust them based on the position to fix the skew issue for replicas.) This doesn't help pre-luminous clients, but I think the end solution will be simpler and more elegant... What do you think? sage > 4. Extra > Some time ago I made such change to perfectly balance Thomson-Reuters > cluster. > It succeeded. > A solution was not accepted, because modification of OSD weights were higher > then 50%, which was caused by fact that different placement rules operated > on different sets of OSDs, and those sets were not disjointed. > > Best regards, > Adam > > > On Sat, Mar 25, 2017 at 7:42 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > Hi Pedro, Loic, > > For what it's worth, my intuition here (which has had a mixed > record as > far as CRUSH goes) is that this is the most promising path > forward. > > Thinking ahead a few steps, and confirming that I'm following > the > discussion so far, if you're able to do get black (or white) box > gradient > descent to work, then this will give us a set of weights for > each item in > the tree for each selection round, derived from the tree > structure and > original (target) weights. That would basically give us a map > of item id > (bucket id or leave item id) to weight for each round. i.e., > > map<int, map<int, float>> weight_by_position; // position -> > item -> weight > > where the 0 round would (I think?) match the target weights, and > each > round after that would skew low-weighted items lower to some > degree. > Right? > > The next question I have is: does this generalize from the > single-bucket > case to the hierarchy? I.e., if I have a "tree" (single bucket) > like > > 3.1 > |_____________ > | \ \ \ > 1.0 1.0 1.0 .1 > > it clearly works, but when we have a multi-level tree like > > > 8.4 > |____________________________________ > | \ \ > 3.1 3.1 2.2 > |_____________ |_____________ |_____________ > | \ \ \ | \ \ \ | \ \ \ > 1.0 1.0 1.0 .1 1.0 1.0 1.0 .1 1.0 1.0 .1 .1 > > and the second round weights skew the small .1 leaves lower, can > we > continue to build the summed-weight hierarchy, such that the > adjusted > weights at the higher level are appropriately adjusted to give > us the > right probabilities of descending into those trees? I'm not > sure if that > logically follows from the above or if my intuition is > oversimplifying > things. > > If this *is* how we think this will shake out, then I'm > wondering if we > should go ahead and build this weigh matrix into CRUSH sooner > rather > than later (i.e., for luminous). As with the explicit > remappings, the > hard part is all done offline, and the adjustments to the CRUSH > mapping > calculation itself (storing and making use of the adjusted > weights for > each round of placement) are relatively straightforward. And > the sooner > this is incorporated into a release the sooner real users will > be able to > roll out code to all clients and start making us of it. > > Thanks again for looking at this problem! I'm excited that we > may be > closing in on a real solution! > > sage > > > > > > On Thu, 23 Mar 2017, Pedro López-Adeva wrote: > > > There are lot of gradient-free methods. I will try first to > run the > > ones available using just scipy > > > (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html). > > Some of them don't require the gradient and some of them can > estimate > > it. The reason to go without the gradient is to run the CRUSH > > algorithm as a black box. In that case this would be the > pseudo-code: > > > > - BEGIN CODE - > > def build_target(desired_freqs): > > def target(weights): > > # run a simulation of CRUSH for a number of objects > > sim_freqs = run_crush(weights) > > # Kullback-Leibler divergence between desired > frequencies and > > current ones > > return loss(sim_freqs, desired_freqs) > > return target > > > > weights = scipy.optimize.minimize(build_target(desired_freqs)) > > - END CODE - > > > > The tricky thing here is that this procedure can be slow if > the > > simulation (run_crush) needs to place a lot of objects to get > accurate > > simulated frequencies. This is true specially if the minimize > method > > attempts to approximate the gradient using finite differences > since it > > will evaluate the target function a number of times > proportional to > > the number of weights). Apart from the ones in scipy I would > try also > > optimization methods that try to perform as few evaluations as > > possible like for example HyperOpt > > (http://hyperopt.github.io/hyperopt/), which by the way takes > into > > account that the target function can be noisy. > > > > This black box approximation is simple to implement and makes > the > > computer do all the work instead of us. > > I think that this black box approximation is worthy to try > even if > > it's not the final one because if this approximation works > then we > > know that a more elaborate one that computes the gradient of > the CRUSH > > algorithm will work for sure. > > > > I can try this black box approximation this weekend not on the > real > > CRUSH algorithm but with the simple implementation I did in > python. If > > it works it's just a matter of substituting one simulation > with > > another and see what happens. > > > > 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@xxxxxxxxxxx>: > > > Hi Pedro, > > > > > > On 03/23/2017 12:49 PM, Pedro López-Adeva wrote: > > >> Hi Loic, > > >> > > >>>From what I see everything seems OK. > > > > > > Cool. I'll keep going in this direction then ! > > > > > >> The interesting thing would be to > > >> test on some complex mapping. The reason is that > "CrushPolicyFamily" > > >> is right now modeling just a single straw bucket not the > full CRUSH > > >> algorithm. > > > > > > A number of use cases use a single straw bucket, maybe the > majority of them. Even though it does not reflect the full range > of what crush can offer, it could be useful. To be more > specific, a crush map that states "place objects so that there > is at most one replica per host" or "one replica per rack" is > common. Such a crushmap can be reduced to a single straw bucket > that contains all the hosts and by using the CrushPolicyFamily, > we can change the weights of each host to fix the probabilities. > The hosts themselves contain disks with varying weights but I > think we can ignore that because crush will only recurse to > place one object within a given host. > > > > > >> That's the work that remains to be done. The only way that > > >> would avoid reimplementing the CRUSH algorithm and > computing the > > >> gradient would be treating CRUSH as a black box and > eliminating the > > >> necessity of computing the gradient either by using a > gradient-free > > >> optimization method or making an estimation of the > gradient. > > > > > > By gradient-free optimization you mean simulated annealing > or Monte Carlo ? > > > > > > Cheers > > > > > >> > > >> > > >> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@xxxxxxxxxxx>: > > >>> Hi, > > >>> > > >>> I modified the crush library to accept two weights (one > for the first disk, the other for the remaining disks)[1]. This > really is a hack for experimentation purposes only ;-) I was > able to run a variation of your code[2] and got the following > results which are encouraging. Do you think what I did is > sensible ? Or is there a problem I don't see ? > > >>> > > >>> Thanks ! > > >>> > > >>> Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8 > 6] > > >>> > ------------------------------------------------------------------------ > > >>> Before: All replicas on each hard drive > > >>> Expected vs actual use (20000 samples) > > >>> disk 0: 1.39e-01 1.12e-01 > > >>> disk 1: 1.11e-01 1.10e-01 > > >>> disk 2: 8.33e-02 1.13e-01 > > >>> disk 3: 1.39e-01 1.11e-01 > > >>> disk 4: 1.11e-01 1.11e-01 > > >>> disk 5: 8.33e-02 1.11e-01 > > >>> disk 6: 1.39e-01 1.12e-01 > > >>> disk 7: 1.11e-01 1.12e-01 > > >>> disk 8: 8.33e-02 1.10e-01 > > >>> it= 1 jac norm=1.59e-01 loss=5.27e-03 > > >>> it= 2 jac norm=1.55e-01 loss=5.03e-03 > > >>> ... > > >>> it= 212 jac norm=1.02e-03 loss=2.41e-07 > > >>> it= 213 jac norm=1.00e-03 loss=2.31e-07 > > >>> Converged to desired accuracy :) > > >>> After: All replicas on each hard drive > > >>> Expected vs actual use (20000 samples) > > >>> disk 0: 1.39e-01 1.42e-01 > > >>> disk 1: 1.11e-01 1.09e-01 > > >>> disk 2: 8.33e-02 8.37e-02 > > >>> disk 3: 1.39e-01 1.40e-01 > > >>> disk 4: 1.11e-01 1.13e-01 > > >>> disk 5: 8.33e-02 8.08e-02 > > >>> disk 6: 1.39e-01 1.38e-01 > > >>> disk 7: 1.11e-01 1.09e-01 > > >>> disk 8: 8.33e-02 8.48e-02 > > >>> > > >>> > > >>> Simulation: R=2 devices capacity [10 10 10 10 1] > > >>> > ------------------------------------------------------------------------ > > >>> Before: All replicas on each hard drive > > >>> Expected vs actual use (20000 samples) > > >>> disk 0: 2.44e-01 2.36e-01 > > >>> disk 1: 2.44e-01 2.38e-01 > > >>> disk 2: 2.44e-01 2.34e-01 > > >>> disk 3: 2.44e-01 2.38e-01 > > >>> disk 4: 2.44e-02 5.37e-02 > > >>> it= 1 jac norm=2.43e-01 loss=2.98e-03 > > >>> it= 2 jac norm=2.28e-01 loss=2.47e-03 > > >>> ... > > >>> it= 37 jac norm=1.28e-03 loss=3.48e-08 > > >>> it= 38 jac norm=1.07e-03 loss=2.42e-08 > > >>> Converged to desired accuracy :) > > >>> After: All replicas on each hard drive > > >>> Expected vs actual use (20000 samples) > > >>> disk 0: 2.44e-01 2.46e-01 > > >>> disk 1: 2.44e-01 2.44e-01 > > >>> disk 2: 2.44e-01 2.41e-01 > > >>> disk 3: 2.44e-01 2.45e-01 > > >>> disk 4: 2.44e-02 2.33e-02 > > >>> > > >>> > > >>> [1] crush hackhttp://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd > 56fee8 > > >>> [2] python-crush hackhttp://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1 > bd25f8f2c4b68 > > >>> > > >>> On 03/19/2017 11:31 PM, Loic Dachary wrote: > > >>>> Hi Pedro, > > >>>> > > >>>> It looks like trying to experiment with crush won't work > as expected because crush does not distinguish the probability > of selecting the first device from the probability of selecting > the second or third device. Am I mistaken ? > > >>>> > > >>>> Cheers > > >>>> > > >>>> On 03/18/2017 10:21 AM, Loic Dachary wrote: > > >>>>> Hi Pedro, > > >>>>> > > >>>>> I'm going to experiment with what you did at > > >>>>> > > >>>>> > https://github.com/plafl/notebooks/blob/master/replication.ipynb > > >>>>> > > >>>>> and the latest python-crush published today. A > comparison function was added that will help measure the data > movement. I'm hoping we can release an offline tool based on > your solution. Please let me know if I should wait before diving > into this, in case you have unpublished drafts or new ideas. > > >>>>> > > >>>>> Cheers > > >>>>> > > >>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote: > > >>>>>> Great, thanks for the clarifications. > > >>>>>> I also think that the most natural way is to keep just > a set of > > >>>>>> weights in the CRUSH map and update them inside the > algorithm. > > >>>>>> > > >>>>>> I keep working on it. > > >>>>>> > > >>>>>> > > >>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil > <sage@xxxxxxxxxxxx>: > > >>>>>>> Hi Pedro, > > >>>>>>> > > >>>>>>> Thanks for taking a look at this! It's a frustrating > problem and we > > >>>>>>> haven't made much headway. > > >>>>>>> > > >>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote: > > >>>>>>>> Hi, > > >>>>>>>> > > >>>>>>>> I will have a look. BTW, I have not progressed that > much but I have > > >>>>>>>> been thinking about it. In order to adapt the > previous algorithm in > > >>>>>>>> the python notebook I need to substitute the > iteration over all > > >>>>>>>> possible devices permutations to iteration over all > the possible > > >>>>>>>> selections that crush would make. That is the main > thing I need to > > >>>>>>>> work on. > > >>>>>>>> > > >>>>>>>> The other thing is of course that weights change for > each replica. > > >>>>>>>> That is, they cannot be really fixed in the crush > map. So the > > >>>>>>>> algorithm inside libcrush, not only the weights in > the map, need to be > > >>>>>>>> changed. The weights in the crush map should reflect > then, maybe, the > > >>>>>>>> desired usage frequencies. Or maybe each replica > should have their own > > >>>>>>>> crush map, but then the information about the > previous selection > > >>>>>>>> should be passed to the next replica placement run so > it avoids > > >>>>>>>> selecting the same one again. > > >>>>>>> > > >>>>>>> My suspicion is that the best solution here (whatever > that means!) > > >>>>>>> leaves the CRUSH weights intact with the desired > distribution, and > > >>>>>>> then generates a set of derivative weights--probably > one set for each > > >>>>>>> round/replica/rank. > > >>>>>>> > > >>>>>>> One nice property of this is that once the support is > added to encode > > >>>>>>> multiple sets of weights, the algorithm used to > generate them is free to > > >>>>>>> change and evolve independently. (In most cases any > change is > > >>>>>>> CRUSH's mapping behavior is difficult to roll out > because all > > >>>>>>> parties participating in the cluster have to support > any new behavior > > >>>>>>> before it is enabled or used.) > > >>>>>>> > > >>>>>>>> I have a question also. Is there any significant > difference between > > >>>>>>>> the device selection algorithm description in the > paper and its final > > >>>>>>>> implementation? > > >>>>>>> > > >>>>>>> The main difference is the "retry_bucket" behavior was > found to be a bad > > >>>>>>> idea; any collision or failed()/overload() case > triggers the > > >>>>>>> retry_descent. > > >>>>>>> > > >>>>>>> There are other changes, of course, but I don't think > they'll impact any > > >>>>>>> solution we come with here (or at least any solution > can be suitably > > >>>>>>> adapted)! > > >>>>>>> > > >>>>>>> sage > > >>>>>> -- > > >>>>>> To unsubscribe from this list: send the line > "unsubscribe ceph-devel" in > > >>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx > > >>>>>> More majordomo info at > http://vger.kernel.org/majordomo-info.html > > >>>>>> > > >>>>> > > >>>> > > >>> > > >>> -- > > >>> Loïc Dachary, Artisan Logiciel Libre > > >> -- > > >> To unsubscribe from this list: send the line "unsubscribe > ceph-devel" in > > >> the body of a message to majordomo@xxxxxxxxxxxxxxx > > >> More majordomo info at > http://vger.kernel.org/majordomo-info.html > > >> > > > > > > -- > > > Loïc Dachary, Artisan Logiciel Libre > > -- > > To unsubscribe from this list: send the line "unsubscribe > ceph-devel" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at > http://vger.kernel.org/majordomo-info.html > > > > > > > >