Re: crush multipick anomaly

Pedro López-Adeva <plopezadeva@xxxxxxxxx> · Mon, 27 Mar 2017 12:37:27 +0200

I have performed some tests as I said using the black box method.
Remember that I have not used the real CRUSH algorithm.
The idea was to compare the results against the white box with
gradient information.

I have used two replicas and 9 disks with the following capacities:
10, 8, 6, 10, 8, 6, 10, 8, 6
Contrary to what I thought scipy.optimize didn't give results I think
because the target function is noisy. This at least made the method I
tried (SLSQP) to return non success, so I switched to HyperOpt.

First result is that of course the method is much slower. This was expected:

Time using jacobian:  0.36s
Time using simulation: 69.19s

But the results I think are OK. The following columns show the
parameters (weights for the second replica placement) estimated using
the jacobian (white box) vs the simulation method (black box).

jac  sim
--------
0.17 0.16
0.11 0.12
0.06 0.06
0.17 0.15
0.11 0.09
0.06 0.08
0.17 0.16
0.11 0.13
0.06 0.06

As you can see the agreement is reasonable.
You can see here the changes I made:
https://github.com/plafl/snippets/commit/ea701d2cffbf3884eab866ce6e2388879e040894

Where to go from here (I think):
1. Perform this same test using the real CRUSH algorithm
2. Improve the method to run faster

2017-03-27 11:27 GMT+02:00 Adam Kupczyk <akupczyk@xxxxxxxxxxxx>:
> Hi,
>
> My understanding is that optimal tweaked weights will depend on:
> 1) pool_id, because of rjenkins(pool_id) in crush
> 2) number of placement groups and replication factor, as it determines
> amount of samples
>
> Therefore tweaked weights should rather be property of instantialized pool,
> not crush placement definition.
>
> If tweaked weights are to be part of crush definition, than for each
> created pool we need to have separate list of weights.
> Is it possible to provide clients with different weights depending on on
> which pool they want to operate?
>
> Best regards,
> Adam
>
> On Mon, Mar 27, 2017 at 10:45 AM, Adam Kupczyk <akupczyk@xxxxxxxxxxxx> wrote:
>> Hi,
>>
>> My understanding is that optimal tweaked weights will depend on:
>> 1) pool_id, because of rjenkins(pool_id) in crush
>> 2) number of placement groups and replication factor, as it determines
>> amount of samples
>>
>> Therefore tweaked weights should rather be property of instantialized pool,
>> not crush placement definition.
>>
>> If tweaked weights are to be part of crush definition, than for each created
>> pool we need to have separate list of weights.
>> Is it possible to provide clients with different weights depending on on
>> which pool they want to operate?
>>
>> Best regards,
>> Adam
>>
>>
>> On Mon, Mar 27, 2017 at 8:45 AM, Loic Dachary <loic@xxxxxxxxxxx> wrote:
>>>
>>>
>>>
>>> On 03/27/2017 04:33 AM, Sage Weil wrote:
>>> > On Sun, 26 Mar 2017, Adam Kupczyk wrote:
>>> >> Hello Sage, Loic, Pedro,
>>> >>
>>> >>
>>> >> I am certain that almost perfect mapping can be achieved by
>>> >> substituting weights from crush map with slightly modified weights.
>>> >> By perfect mapping I mean we get on each OSD number of PGs exactly
>>> >> proportional to weights specified in crush map.
>>> >>
>>> >> 1. Example
>>> >> Lets think of PGs of single object pool.
>>> >> We have OSDs with following weights:
>>> >> [10, 10, 10, 5, 5]
>>> >>
>>> >> Ideally, we would like following distribution of 200PG x 3 copies = 600
>>> >> PGcopies :
>>> >> [150, 150, 150, 75, 75]
>>> >>
>>> >> However, because crush simulates random process we have:
>>> >> [143, 152, 158, 71, 76]
>>> >>
>>> >> We could have obtained perfect distribution had we used weights like
>>> >> this:
>>> >> [10.2, 9.9, 9.6, 5.2, 4.9]
>>> >>
>>> >>
>>> >> 2. Obtaining perfect mapping weights from OSD capacity weights
>>> >>
>>> >> When we apply crush for the first time, distribution of PGs comes as
>>> >> random.
>>> >> CRUSH([10, 10, 10, 5, 5]) -> [143, 152, 158, 71, 76]
>>> >>
>>> >> But CRUSH is not random proces at all, it behaves in numerically stable
>>> >> way.
>>> >> Specifically, if we increase weight on one node, we will get more PGs
>>> >> on
>>> >> this node and less on every other node:
>>> >> CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]
>>> >>
>>> >> Now, finding ideal weights can be done by any numerical minimization
>>> >> method,
>>> >> for example NLMS.
>>> >>
>>> >>
>>> >> 3. The proposal
>>> >> For each pool, from initial weights given in crush map perfect weights
>>> >> will
>>> >> be derived.
>>> >> This weights will be used to calculate PG distribution. This of course
>>> >> will
>>> >> be close to perfect.
>>> >>
>>> >> 3a: Downside when OSD is out
>>> >> When an OSD is out, missing PG copies will be replicated elsewhere.
>>> >> Because now weights deviate from OSD capacity, some OSDs will
>>> >> statistically
>>> >> get more copies then they should.
>>> >> This unevenness in distribution is proportional to scale of deviation
>>> >> of
>>> >> calculated weights to capacity weights.
>>> >>
>>> >> 3b: Upside
>>> >> This all can be achieved without changes to crush.
>>> >
>>> > Yes!
>>> >
>>> > And no.  You're totally right--we should use an offline optimization to
>>> > tweak the crush input weights to get a better balance.  It won't be
>>> > robust
>>> > to changes to the cluster, but we can incrementally optimize after that
>>> > happens to converge on something better.
>>> >
>>> > The problem with doing this with current versions of Ceph is that we
>>> > lose
>>> > the original "input" or "target" weights (i.e., the actual size of
>>> > the OSD) that we want to converge on.  This is one reason why we haven't
>>> > done something like this before.
>>> >
>>> > In luminous we *could* work around this by storing those canonical
>>> > weights outside of crush using something (probably?) ugly and
>>> > maintain backward compatibility with older clients using existing
>>> > CRUSH behavior.
>>>
>>> These canonical weights could be stored in crush by creating dedicated
>>> buckets. For instance the root-canonical bucket could be created to store
>>> the canonical weights of the root bucket. The sysadmin needs to be aware of
>>> the difference and know to add a new device in the host01-canonical bucket
>>> instead of the host01 bucket. And to run an offline tool to keep the two
>>> buckets in sync and compute the weight to use for placement derived from the
>>> weights representing the device capacity.
>>>
>>> It is a little bit ugly ;-)
>>>
>>> > OR, (and this is my preferred route), if the multi-pick anomaly approach
>>> > that Pedro is working on works out, we'll want to extend the CRUSH map
>>> > to
>>> > include a set of derivative weights used for actual placement
>>> > calculations
>>> > instead of the canonical target weights, and we can do what you're
>>> > proposing *and* solve the multipick problem with one change in the crush
>>> > map and algorithm.  (Actually choosing those derivative weights will
>>> > be an offline process that can both improve the balance for the inputs
>>> > we
>>> > care about *and* adjust them based on the position to fix the skew issue
>>> > for replicas.)  This doesn't help pre-luminous clients, but I think the
>>> > end solution will be simpler and more elegant...
>>> >
>>> > What do you think?
>>> >
>>> > sage
>>> >
>>> >
>>> >> 4. Extra
>>> >> Some time ago I made such change to perfectly balance Thomson-Reuters
>>> >> cluster.
>>> >> It succeeded.
>>> >> A solution was not accepted, because modification of OSD weights were
>>> >> higher
>>> >> then 50%, which was caused by fact that different placement rules
>>> >> operated
>>> >> on different sets of OSDs, and those sets were not disjointed.
>>> >
>>> >
>>> >>
>>> >> Best regards,
>>> >> Adam
>>> >>
>>> >>
>>> >> On Sat, Mar 25, 2017 at 7:42 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>>> >>       Hi Pedro, Loic,
>>> >>
>>> >>       For what it's worth, my intuition here (which has had a mixed
>>> >>       record as
>>> >>       far as CRUSH goes) is that this is the most promising path
>>> >>       forward.
>>> >>
>>> >>       Thinking ahead a few steps, and confirming that I'm following
>>> >>       the
>>> >>       discussion so far, if you're able to do get black (or white) box
>>> >>       gradient
>>> >>       descent to work, then this will give us a set of weights for
>>> >>       each item in
>>> >>       the tree for each selection round, derived from the tree
>>> >>       structure and
>>> >>       original (target) weights.  That would basically give us a map
>>> >>       of item id
>>> >>       (bucket id or leave item id) to weight for each round.  i.e.,
>>> >>
>>> >>        map<int, map<int, float>> weight_by_position;  // position ->
>>> >>       item -> weight
>>> >>
>>> >>       where the 0 round would (I think?) match the target weights, and
>>> >>       each
>>> >>       round after that would skew low-weighted items lower to some
>>> >>       degree.
>>> >>       Right?
>>> >>
>>> >>       The next question I have is: does this generalize from the
>>> >>       single-bucket
>>> >>       case to the hierarchy?  I.e., if I have a "tree" (single bucket)
>>> >>       like
>>> >>
>>> >>       3.1
>>> >>        |_____________
>>> >>        |   \    \    \
>>> >>       1.0  1.0  1.0  .1
>>> >>
>>> >>       it clearly works, but when we have a multi-level tree like
>>> >>
>>> >>
>>> >>       8.4
>>> >>        |____________________________________
>>> >>        |                 \                  \
>>> >>       3.1                3.1                2.2
>>> >>        |_____________     |_____________     |_____________
>>> >>        |   \    \    \    |   \    \    \    |   \    \    \
>>> >>       1.0  1.0  1.0  .1   1.0  1.0  1.0  .1  1.0  1.0 .1   .1
>>> >>
>>> >>       and the second round weights skew the small .1 leaves lower, can
>>> >>       we
>>> >>       continue to build the summed-weight hierarchy, such that the
>>> >>       adjusted
>>> >>       weights at the higher level are appropriately adjusted to give
>>> >>       us the
>>> >>       right probabilities of descending into those trees?  I'm not
>>> >>       sure if that
>>> >>       logically follows from the above or if my intuition is
>>> >>       oversimplifying
>>> >>       things.
>>> >>
>>> >>       If this *is* how we think this will shake out, then I'm
>>> >>       wondering if we
>>> >>       should go ahead and build this weigh matrix into CRUSH sooner
>>> >>       rather
>>> >>       than later (i.e., for luminous).  As with the explicit
>>> >>       remappings, the
>>> >>       hard part is all done offline, and the adjustments to the CRUSH
>>> >>       mapping
>>> >>       calculation itself (storing and making use of the adjusted
>>> >>       weights for
>>> >>       each round of placement) are relatively straightforward.  And
>>> >>       the sooner
>>> >>       this is incorporated into a release the sooner real users will
>>> >>       be able to
>>> >>       roll out code to all clients and start making us of it.
>>> >>
>>> >>       Thanks again for looking at this problem!  I'm excited that we
>>> >>       may be
>>> >>       closing in on a real solution!
>>> >>
>>> >>       sage
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>       On Thu, 23 Mar 2017, Pedro López-Adeva wrote:
>>> >>
>>> >>       > There are lot of gradient-free methods. I will try first to
>>> >>       run the
>>> >>       > ones available using just scipy
>>> >>       >
>>> >>
>>> >> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>>> >>       > Some of them don't require the gradient and some of them can
>>> >>       estimate
>>> >>       > it. The reason to go without the gradient is to run the CRUSH
>>> >>       > algorithm as a black box. In that case this would be the
>>> >>       pseudo-code:
>>> >>       >
>>> >>       > - BEGIN CODE -
>>> >>       > def build_target(desired_freqs):
>>> >>       >     def target(weights):
>>> >>       >         # run a simulation of CRUSH for a number of objects
>>> >>       >         sim_freqs = run_crush(weights)
>>> >>       >         # Kullback-Leibler divergence between desired
>>> >>       frequencies and
>>> >>       > current ones
>>> >>       >         return loss(sim_freqs, desired_freqs)
>>> >>       >    return target
>>> >>       >
>>> >>       > weights = scipy.optimize.minimize(build_target(desired_freqs))
>>> >>       > - END CODE -
>>> >>       >
>>> >>       > The tricky thing here is that this procedure can be slow if
>>> >>       the
>>> >>       > simulation (run_crush) needs to place a lot of objects to get
>>> >>       accurate
>>> >>       > simulated frequencies. This is true specially if the minimize
>>> >>       method
>>> >>       > attempts to approximate the gradient using finite differences
>>> >>       since it
>>> >>       > will evaluate the target function a number of times
>>> >>       proportional to
>>> >>       > the number of weights). Apart from the ones in scipy I would
>>> >>       try also
>>> >>       > optimization methods that try to perform as few evaluations as
>>> >>       > possible like for example HyperOpt
>>> >>       > (http://hyperopt.github.io/hyperopt/), which by the way takes
>>> >>       into
>>> >>       > account that the target function can be noisy.
>>> >>       >
>>> >>       > This black box approximation is simple to implement and makes
>>> >>       the
>>> >>       > computer do all the work instead of us.
>>> >>       > I think that this black box approximation is worthy to try
>>> >>       even if
>>> >>       > it's not the final one because if this approximation works
>>> >>       then we
>>> >>       > know that a more elaborate one that computes the gradient of
>>> >>       the CRUSH
>>> >>       > algorithm will work for sure.
>>> >>       >
>>> >>       > I can try this black box approximation this weekend not on the
>>> >>       real
>>> >>       > CRUSH algorithm but with the simple implementation I did in
>>> >>       python. If
>>> >>       > it works it's just a matter of substituting one simulation
>>> >>       with
>>> >>       > another and see what happens.
>>> >>       >
>>> >>       > 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@xxxxxxxxxxx>:
>>> >>       > > Hi Pedro,
>>> >>       > >
>>> >>       > > On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>> >>       > >> Hi Loic,
>>> >>       > >>
>>> >>       > >>>From what I see everything seems OK.
>>> >>       > >
>>> >>       > > Cool. I'll keep going in this direction then !
>>> >>       > >
>>> >>       > >> The interesting thing would be to
>>> >>       > >> test on some complex mapping. The reason is that
>>> >>       "CrushPolicyFamily"
>>> >>       > >> is right now modeling just a single straw bucket not the
>>> >>       full CRUSH
>>> >>       > >> algorithm.
>>> >>       > >
>>> >>       > > A number of use cases use a single straw bucket, maybe the
>>> >>       majority of them. Even though it does not reflect the full range
>>> >>       of what crush can offer, it could be useful. To be more
>>> >>       specific, a crush map that states "place objects so that there
>>> >>       is at most one replica per host" or "one replica per rack" is
>>> >>       common. Such a crushmap can be reduced to a single straw bucket
>>> >>       that contains all the hosts and by using the CrushPolicyFamily,
>>> >>       we can change the weights of each host to fix the probabilities.
>>> >>       The hosts themselves contain disks with varying weights but I
>>> >>       think we can ignore that because crush will only recurse to
>>> >>       place one object within a given host.
>>> >>       > >
>>> >>       > >> That's the work that remains to be done. The only way that
>>> >>       > >> would avoid reimplementing the CRUSH algorithm and
>>> >>       computing the
>>> >>       > >> gradient would be treating CRUSH as a black box and
>>> >>       eliminating the
>>> >>       > >> necessity of computing the gradient either by using a
>>> >>       gradient-free
>>> >>       > >> optimization method or making an estimation of the
>>> >>       gradient.
>>> >>       > >
>>> >>       > > By gradient-free optimization you mean simulated annealing
>>> >>       or Monte Carlo ?
>>> >>       > >
>>> >>       > > Cheers
>>> >>       > >
>>> >>       > >>
>>> >>       > >>
>>> >>       > >> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@xxxxxxxxxxx>:
>>> >>       > >>> Hi,
>>> >>       > >>>
>>> >>       > >>> I modified the crush library to accept two weights (one
>>> >>       for the first disk, the other for the remaining disks)[1]. This
>>> >>       really is a hack for experimentation purposes only ;-) I was
>>> >>       able to run a variation of your code[2] and got the following
>>> >>       results which are encouraging. Do you think what I did is
>>> >>       sensible ? Or is there a problem I don't see ?
>>> >>       > >>>
>>> >>       > >>> Thanks !
>>> >>       > >>>
>>> >>       > >>> Simulation: R=2 devices capacity [10  8  6 10  8  6 10  8
>>> >>       6]
>>> >>       > >>>
>>> >>
>>> >> ------------------------------------------------------------------------
>>> >>       > >>> Before: All replicas on each hard drive
>>> >>       > >>> Expected vs actual use (20000 samples)
>>> >>       > >>>  disk 0: 1.39e-01 1.12e-01
>>> >>       > >>>  disk 1: 1.11e-01 1.10e-01
>>> >>       > >>>  disk 2: 8.33e-02 1.13e-01
>>> >>       > >>>  disk 3: 1.39e-01 1.11e-01
>>> >>       > >>>  disk 4: 1.11e-01 1.11e-01
>>> >>       > >>>  disk 5: 8.33e-02 1.11e-01
>>> >>       > >>>  disk 6: 1.39e-01 1.12e-01
>>> >>       > >>>  disk 7: 1.11e-01 1.12e-01
>>> >>       > >>>  disk 8: 8.33e-02 1.10e-01
>>> >>       > >>> it=    1 jac norm=1.59e-01 loss=5.27e-03
>>> >>       > >>> it=    2 jac norm=1.55e-01 loss=5.03e-03
>>> >>       > >>> ...
>>> >>       > >>> it=  212 jac norm=1.02e-03 loss=2.41e-07
>>> >>       > >>> it=  213 jac norm=1.00e-03 loss=2.31e-07
>>> >>       > >>> Converged to desired accuracy :)
>>> >>       > >>> After: All replicas on each hard drive
>>> >>       > >>> Expected vs actual use (20000 samples)
>>> >>       > >>>  disk 0: 1.39e-01 1.42e-01
>>> >>       > >>>  disk 1: 1.11e-01 1.09e-01
>>> >>       > >>>  disk 2: 8.33e-02 8.37e-02
>>> >>       > >>>  disk 3: 1.39e-01 1.40e-01
>>> >>       > >>>  disk 4: 1.11e-01 1.13e-01
>>> >>       > >>>  disk 5: 8.33e-02 8.08e-02
>>> >>       > >>>  disk 6: 1.39e-01 1.38e-01
>>> >>       > >>>  disk 7: 1.11e-01 1.09e-01
>>> >>       > >>>  disk 8: 8.33e-02 8.48e-02
>>> >>       > >>>
>>> >>       > >>>
>>> >>       > >>> Simulation: R=2 devices capacity [10 10 10 10  1]
>>> >>       > >>>
>>> >>
>>> >> ------------------------------------------------------------------------
>>> >>       > >>> Before: All replicas on each hard drive
>>> >>       > >>> Expected vs actual use (20000 samples)
>>> >>       > >>>  disk 0: 2.44e-01 2.36e-01
>>> >>       > >>>  disk 1: 2.44e-01 2.38e-01
>>> >>       > >>>  disk 2: 2.44e-01 2.34e-01
>>> >>       > >>>  disk 3: 2.44e-01 2.38e-01
>>> >>       > >>>  disk 4: 2.44e-02 5.37e-02
>>> >>       > >>> it=    1 jac norm=2.43e-01 loss=2.98e-03
>>> >>       > >>> it=    2 jac norm=2.28e-01 loss=2.47e-03
>>> >>       > >>> ...
>>> >>       > >>> it=   37 jac norm=1.28e-03 loss=3.48e-08
>>> >>       > >>> it=   38 jac norm=1.07e-03 loss=2.42e-08
>>> >>       > >>> Converged to desired accuracy :)
>>> >>       > >>> After: All replicas on each hard drive
>>> >>       > >>> Expected vs actual use (20000 samples)
>>> >>       > >>>  disk 0: 2.44e-01 2.46e-01
>>> >>       > >>>  disk 1: 2.44e-01 2.44e-01
>>> >>       > >>>  disk 2: 2.44e-01 2.41e-01
>>> >>       > >>>  disk 3: 2.44e-01 2.45e-01
>>> >>       > >>>  disk 4: 2.44e-02 2.33e-02
>>> >>       > >>>
>>> >>       > >>>
>>> >>       > >>> [1] crush
>>> >> hackhttp://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd
>>> >>       56fee8
>>> >>       > >>> [2] python-crush
>>> >> hackhttp://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1
>>> >>       bd25f8f2c4b68
>>> >>       > >>>
>>> >>       > >>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>> >>       > >>>> Hi Pedro,
>>> >>       > >>>>
>>> >>       > >>>> It looks like trying to experiment with crush won't work
>>> >>       as expected because crush does not distinguish the probability
>>> >>       of selecting the first device from the probability of selecting
>>> >>       the second or third device. Am I mistaken ?
>>> >>       > >>>>
>>> >>       > >>>> Cheers
>>> >>       > >>>>
>>> >>       > >>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>> >>       > >>>>> Hi Pedro,
>>> >>       > >>>>>
>>> >>       > >>>>> I'm going to experiment with what you did at
>>> >>       > >>>>>
>>> >>       > >>>>>
>>> >>       https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>> >>       > >>>>>
>>> >>       > >>>>> and the latest python-crush published today. A
>>> >>       comparison function was added that will help measure the data
>>> >>       movement. I'm hoping we can release an offline tool based on
>>> >>       your solution. Please let me know if I should wait before diving
>>> >>       into this, in case you have unpublished drafts or new ideas.
>>> >>       > >>>>>
>>> >>       > >>>>> Cheers
>>> >>       > >>>>>
>>> >>       > >>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>> >>       > >>>>>> Great, thanks for the clarifications.
>>> >>       > >>>>>> I also think that the most natural way is to keep just
>>> >>       a set of
>>> >>       > >>>>>> weights in the CRUSH map and update them inside the
>>> >>       algorithm.
>>> >>       > >>>>>>
>>> >>       > >>>>>> I keep working on it.
>>> >>       > >>>>>>
>>> >>       > >>>>>>
>>> >>       > >>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil
>>> >>       <sage@xxxxxxxxxxxx>:
>>> >>       > >>>>>>> Hi Pedro,
>>> >>       > >>>>>>>
>>> >>       > >>>>>>> Thanks for taking a look at this!  It's a frustrating
>>> >>       problem and we
>>> >>       > >>>>>>> haven't made much headway.
>>> >>       > >>>>>>>
>>> >>       > >>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>> >>       > >>>>>>>> Hi,
>>> >>       > >>>>>>>>
>>> >>       > >>>>>>>> I will have a look. BTW, I have not progressed that
>>> >>       much but I have
>>> >>       > >>>>>>>> been thinking about it. In order to adapt the
>>> >>       previous algorithm in
>>> >>       > >>>>>>>> the python notebook I need to substitute the
>>> >>       iteration over all
>>> >>       > >>>>>>>> possible devices permutations to iteration over all
>>> >>       the possible
>>> >>       > >>>>>>>> selections that crush would make. That is the main
>>> >>       thing I need to
>>> >>       > >>>>>>>> work on.
>>> >>       > >>>>>>>>
>>> >>       > >>>>>>>> The other thing is of course that weights change for
>>> >>       each replica.
>>> >>       > >>>>>>>> That is, they cannot be really fixed in the crush
>>> >>       map. So the
>>> >>       > >>>>>>>> algorithm inside libcrush, not only the weights in
>>> >>       the map, need to be
>>> >>       > >>>>>>>> changed. The weights in the crush map should reflect
>>> >>       then, maybe, the
>>> >>       > >>>>>>>> desired usage frequencies. Or maybe each replica
>>> >>       should have their own
>>> >>       > >>>>>>>> crush map, but then the information about the
>>> >>       previous selection
>>> >>       > >>>>>>>> should be passed to the next replica placement run so
>>> >>       it avoids
>>> >>       > >>>>>>>> selecting the same one again.
>>> >>       > >>>>>>>
>>> >>       > >>>>>>> My suspicion is that the best solution here (whatever
>>> >>       that means!)
>>> >>       > >>>>>>> leaves the CRUSH weights intact with the desired
>>> >>       distribution, and
>>> >>       > >>>>>>> then generates a set of derivative weights--probably
>>> >>       one set for each
>>> >>       > >>>>>>> round/replica/rank.
>>> >>       > >>>>>>>
>>> >>       > >>>>>>> One nice property of this is that once the support is
>>> >>       added to encode
>>> >>       > >>>>>>> multiple sets of weights, the algorithm used to
>>> >>       generate them is free to
>>> >>       > >>>>>>> change and evolve independently.  (In most cases any
>>> >>       change is
>>> >>       > >>>>>>> CRUSH's mapping behavior is difficult to roll out
>>> >>       because all
>>> >>       > >>>>>>> parties participating in the cluster have to support
>>> >>       any new behavior
>>> >>       > >>>>>>> before it is enabled or used.)
>>> >>       > >>>>>>>
>>> >>       > >>>>>>>> I have a question also. Is there any significant
>>> >>       difference between
>>> >>       > >>>>>>>> the device selection algorithm description in the
>>> >>       paper and its final
>>> >>       > >>>>>>>> implementation?
>>> >>       > >>>>>>>
>>> >>       > >>>>>>> The main difference is the "retry_bucket" behavior was
>>> >>       found to be a bad
>>> >>       > >>>>>>> idea; any collision or failed()/overload() case
>>> >>       triggers the
>>> >>       > >>>>>>> retry_descent.
>>> >>       > >>>>>>>
>>> >>       > >>>>>>> There are other changes, of course, but I don't think
>>> >>       they'll impact any
>>> >>       > >>>>>>> solution we come with here (or at least any solution
>>> >>       can be suitably
>>> >>       > >>>>>>> adapted)!
>>> >>       > >>>>>>>
>>> >>       > >>>>>>> sage
>>> >>       > >>>>>> --
>>> >>       > >>>>>> To unsubscribe from this list: send the line
>>> >>       "unsubscribe ceph-devel" in
>>> >>       > >>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> >>       > >>>>>> More majordomo info at
>>> >>       http://vger.kernel.org/majordomo-info.html
>>> >>       > >>>>>>
>>> >>       > >>>>>
>>> >>       > >>>>
>>> >>       > >>>
>>> >>       > >>> --
>>> >>       > >>> Loïc Dachary, Artisan Logiciel Libre
>>> >>       > >> --
>>> >>       > >> To unsubscribe from this list: send the line "unsubscribe
>>> >>       ceph-devel" in
>>> >>       > >> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> >>       > >> More majordomo info at
>>> >>       http://vger.kernel.org/majordomo-info.html
>>> >>       > >>
>>> >>       > >
>>> >>       > > --
>>> >>       > > Loïc Dachary, Artisan Logiciel Libre
>>> >>       > --
>>> >>       > To unsubscribe from this list: send the line "unsubscribe
>>> >>       ceph-devel" in
>>> >>       > the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> >>       > More majordomo info at
>>> >>       http://vger.kernel.org/majordomo-info.html
>>> >>       >
>>> >>       >
>>> >>
>>> >>
>>> >>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html