On Mon, Feb 6, 2017 at 10:13 AM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > Hi Loic, > > Here's my current understanding of the problem. (Below I work with the > example having four OSDs with weights 3, 3, 3, 1, respectively). > > I'm elaborating on the observation that for every replication "round", > the PG ratios for each and every OSD must be equal to the "target" or > goal weight of that OSD. So, for an OSD that should get 10% of PGs, > that OSD gets 10% in round 1, 10% in round 2, etc... But we need to > multiply each of these ratios by the probability that this OSD is > still available in Round r. > > Hence I believe we have this loop invariant: > > P(OSD.x still available in Round r) * (Weight of OSD.x in Round r) > / (Total sum of all weights in Round r) == (Original "target" Weight > of OSD.x) / (Total sum of all target weights) > > I simplify all these terms: > P(OSD.x still available for Round r) = P_x_r > Weight of OSD.x in Round r = W_x_r > Total sum of all weights in Round r = T_r > Original "target" Weight of OSD.x = W_x > Total sum of all target weights = T > > So rewriting the equation, we have: > > P_x_r * W_x_r / T_r == W_x / T > > We then calculate the needed weight of OSD.x in Round r. W_x_r is what > we're trying to solve for!! > > W_x_r = W_x / T * T_r / P_x_r > > The first term W_x / T is a constant and easy to compute. (For my > example small OSD, W_x / T = 0.1) > > P_x_r is also -- I believe -- simple to compute. P_x_r gets smaller > for each round and is a function of what happened in the previous > round: > > Round 1: P_x_1 = 1.0 > Round 2: P_x_2 = P_x_1 * (1 - W_x_1 / T_1) > Round 3: P_x_3 = P_x_2 * (1 - W_x_2 / T_2) > ... > > But T_r is a challenge -- T_r is the sum of W_x_r for all x in round > r. Hence, the problem is that we don't know T_r until *after* we > compute all W_x_r's for that round. I tried various ways to estimate > T_r but didn't make any progress. > > Do you think this formulation is correct? Any clever ideas where to go next? > Something is wrong, because the system of equations that this gives is unsolvable. In round 2 for the 3,3,3,1 OSD set, assuming the first, weight 3, OSD.0 was chosen in the first round, we have: P_1_2 = (1-3/10) = 0.7 P_2_2 = (1-3/10) = 0.7 P_3_2 = (1-1/10) = 0.9 And we know: W_1 / T = 3/10 = 0.3 W_2 / T = 3/10 = 0.3 W_3 / T = 1/10 = 0.1 So we can describe the whole round: W_1_2 = W_1 / T * T_2 / P_1_2 = 0.3 * T_2 / 0.7 = 0.4286 T_2 W_2_2 = W_2 / T * T_2 / P_2_2 = 0.3 * T_2 / 0.7 = 0.4286 T_2 W_3_2 = W_3 / T * T_2 / P_3_2 = 0.1 * T_2 / 0.9 = 0.1111 T_2 W_1_2 + W_2_2 + W_3_2 = T_2 Putting this all into a solver gives 0.9683 * T_2 = T_2, which is nonsense. -- Dan > Cheers, Dan > > > > > On Mon, Feb 6, 2017 at 9:31 AM, Loic Dachary <loic@xxxxxxxxxxx> wrote: >> Hi Dan, >> >> Your script turns out to be a nice self contained problem statement :-) Tomasz & Szymon discussed it today @ FOSDEM and I was enlightened by the way Szymon described how to calculate P(E|A) using a probability tree (see the picture at http://dachary.org/loic/crush-probability-schema.jpg). >> >> Cheers >> >> On 02/03/2017 06:37 PM, Dan van der Ster wrote: >>> Anyway, here's my simple simulation. It might be helpful for testing >>> ideas quickly: https://gist.github.com/anonymous/929d799d5f80794b293783acb9108992 >>> >>> Below is the output using the P(pick small | first pick not small) >>> observation, using OSDs having weights 3, 3, 3, & 1 respectively. It >>> seems to *almost* work, but only when we have just one small OSD. >>> >>> See the end of the script for other various ideas. >>> >>> -- Dan >>> >>>> python mpa.py >>> OSDs (id: weight): {0: 3, 1: 3, 2: 3, 3: 1} >>> >>> Expected PGs per OSD: {0: 90000, 1: 90000, 2: 90000, 3: 30000} >>> >>> Simulating with existing CRUSH >>> >>> Observed: {0: 85944, 1: 85810, 2: 85984, 3: 42262} >>> Observed for Nth replica: [{0: 29936, 1: 30045, 2: 30061, 3: 9958}, >>> {0: 29037, 1: 29073, 2: 29041, 3: 12849}, {0: 26971, 1: 26692, 2: >>> 26882, 3: 19455}] >>> >>> Now trying your new algorithm >>> >>> Observed: {0: 89423, 1: 89443, 2: 89476, 3: 31658} >>> Observed for Nth replica: [{0: 30103, 1: 30132, 2: 29805, 3: 9960}, >>> {0: 29936, 1: 29964, 2: 29796, 3: 10304}, {0: 29384, 1: 29347, 2: >>> 29875, 3: 11394}] >>> >>> >>> On Fri, Feb 3, 2017 at 4:26 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: >>>> On Fri, Feb 3, 2017 at 3:47 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: >>>>> On Fri, 3 Feb 2017, Loic Dachary wrote: >>>>>> On 01/26/2017 12:13 PM, Loic Dachary wrote: >>>>>>> Hi Sage, >>>>>>> >>>>>>> Still trying to understand what you did :-) I have one question below. >>>>>>> >>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote: >>>>>>>> This is a longstanding bug, >>>>>>>> >>>>>>>> http://tracker.ceph.com/issues/15653 >>>>>>>> >>>>>>>> that causes low-weighted devices to get more data than they should. Loic's >>>>>>>> recent activity resurrected discussion on the original PR >>>>>>>> >>>>>>>> https://github.com/ceph/ceph/pull/10218 >>>>>>>> >>>>>>>> but since it's closed and almost nobody will see it I'm moving the >>>>>>>> discussion here. >>>>>>>> >>>>>>>> The main news is that I have a simple adjustment for the weights that >>>>>>>> works (almost perfectly) for the 2nd round of placements. The solution is >>>>>>>> pretty simple, although as with most probabilities it tends to make my >>>>>>>> brain hurt. >>>>>>>> >>>>>>>> The idea is that, on the second round, the original weight for the small >>>>>>>> OSD (call it P(pick small)) isn't what we should use. Instead, we want >>>>>>>> P(pick small | first pick not small). Since P(a|b) (the probability of a >>>>>>>> given b) is P(a && b) / P(b), >>>>>>> >>>>>>> >From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition >>>>>>> >>>>>>>> >>>>>>>> P(pick small | first pick not small) >>>>>>>> = P(pick small && first pick not small) / P(first pick not small) >>>>>>>> >>>>>>>> The last term is easy to calculate, >>>>>>>> >>>>>>>> P(first pick not small) = (total_weight - small_weight) / total_weight >>>>>>>> >>>>>>>> and the && term is the distribution we're trying to produce. >>>>>>> >>>>>>> https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ? >>>>>>> >>>>>>>> For exmaple, >>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have >>>>>>>> their second replica be the small OSD. So >>>>>>>> >>>>>>>> P(pick small && first pick not small) = small_weight / total_weight >>>>>>>> >>>>>>>> Putting those together, >>>>>>>> >>>>>>>> P(pick small | first pick not small) >>>>>>>> = P(pick small && first pick not small) / P(first pick not small) >>>>>>>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight) >>>>>>>> = small_weight / (total_weight - small_weight) >>>>>>>> >>>>>>>> This is, on the second round, we should adjust the weights by the above so >>>>>>>> that we get the right distribution of second choices. It turns out it >>>>>>>> works to adjust *all* weights like this to get hte conditional probability >>>>>>>> that they weren't already chosen. >>>>>>>> >>>>>>>> I have a branch that hacks this into straw2 and it appears to work >>>>>>> >>>>>>> This is https://github.com/liewegas/ceph/commit/wip-crush-multipick >>>>>> >>>>>> In >>>>>> >>>>>> https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316 >>>>>> >>>>>> double neww = oldw / (bucketw - oldw) * bucketw; >>>>>> >>>>>> I don't get why we need "* bucketw" at the end ? >>>>> >>>>> It's just to keep the values within a reasonable range so that we don't >>>>> lose precision by dropping down into small integers. >>>>> >>>>> I futzed around with this some more last week trying to get the third >>>>> replica to work and ended up doubting that this piece is correct. The >>>>> ratio between the big and small OSDs in my [99 99 99 99 4] example varies >>>>> slightly from what I would expect from first principles and what I get out >>>>> of this derivation by about 1%.. which would explain the bias I as seeing. >>>>> >>>>> I'm hoping we can find someone with a strong stats/probability background >>>>> and loads of free time who can tackle this... >>>>> >>>> >>>> I'm *not* that person, but I gave it a go last weekend and realized a >>>> few things: >>>> >>>> 1. We should add the additional constraint that for all PGs assigned >>>> to an OSD, 1/N of them must be primary replicas, 1/N must be >>>> secondary, 1/N must be tertiary, etc. for N replicas/stripes. E.g. for >>>> a 3 replica pool, the "small" OSD should still have the property that >>>> 1/3rd are primaries, 1/3rd are secondary, 1/3rd are tertiary. >>>> >>>> 2. I believe this is a case of the balls-into-bins problem -- we have >>>> colored balls and weighted bins. I didn't find a definition of the >>>> problem where the goal is to allow users to specify weights which must >>>> be respected after N rounds. >>>> >>>> 3. I wrote some quick python to simulate different reweighting >>>> algorithms. The solution is definitely not obvious - I often thought >>>> I'd solved it (e.g. for simple OSD weight sets like 3, 3, 3, 1) - but >>>> changing the OSDs weights to e.g. 3, 3, 1, 1 completely broke things. >>>> I can clean up and share that python if it's can help. >>>> >>>> My gut feeling is that because CRUSH trees and rulesets can be >>>> arbitrarily complex, the most pragmatic & reliable way to solve this >>>> problem is to balance the PGs with a reweight-by-pg loop at crush >>>> compilation time. This is what admins should do now -- we should just >>>> automate it. >>>> >>>> Cheers, Dan >>>> >>>> P.S. -- maybe these guys can help: http://math.stackexchange.com/ >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> >> -- >> Loïc Dachary, Artisan Logiciel Libre >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html