On 02/06/2017 03:11 PM, Jaze Lee wrote: > 2017-02-06 16:18 GMT+08:00 Loic Dachary <loic@xxxxxxxxxxx>: >> Hi, >> >> On 02/06/2017 04:08 AM, Jaze Lee wrote: >>> It is more complicated than i have expected..... >>> I viewed http://tracker.ceph.com/issues/15653, and know that if the >>> replica number is >>> bigger than the host we choose, we may meet the problem. >>> >>> That is >>> if we have >>> host: a b c d >>> host: e f g h >>> host: i j k l >>> >>> we only choose one from each host for replica three, and the distribution >>> is as we expected? Right ? >>> >>> >>> The problem described in http://tracker.ceph.com/issues/15653, may happen >>> when >>> 1) >>> host: a b c d e f g >>> >>> and we choose all three replica from this host. But this is few happen >>> in production. Right? >>> >>> >>> May be i do not understand the problem correctly ? >> >> The problem also happens with host: a b c d e f g when you try to get three replicas that are not on the same disk. You can experiment with Dan's script > > Yes, I mean why we choose three from one host ? Because it should work in this specific case. And also because this is a problem that shows in every situation, not just this specific situation. Cheers > In production the host > number is ALWAYS > more than replica number..... > > root > rack-0 > host A > host B > rack-1 > host C > host D > rack -2 > host E > host F > > when choose pg 1.1 for osd, it will always choose one from rack-0, one > from rack-1, one from rack-2. > any pg will cause one be choosed from rack-0, rack-1, rack-2. > > The problem is happened when we want to choose more than one osd from > a bucket for a pg, right ? > > > >> >> https://gist.github.com/anonymous/929d799d5f80794b293783acb9108992 >> >> Cheers >> >> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> 2017-02-04 2:54 GMT+08:00 Loic Dachary <loic@xxxxxxxxxxx>: >>>> >>>> >>>> On 02/03/2017 04:08 PM, Loic Dachary wrote: >>>>> >>>>> >>>>> On 02/03/2017 03:47 PM, Sage Weil wrote: >>>>>> On Fri, 3 Feb 2017, Loic Dachary wrote: >>>>>>> On 01/26/2017 12:13 PM, Loic Dachary wrote: >>>>>>>> Hi Sage, >>>>>>>> >>>>>>>> Still trying to understand what you did :-) I have one question below. >>>>>>>> >>>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote: >>>>>>>>> This is a longstanding bug, >>>>>>>>> >>>>>>>>> http://tracker.ceph.com/issues/15653 >>>>>>>>> >>>>>>>>> that causes low-weighted devices to get more data than they should. Loic's >>>>>>>>> recent activity resurrected discussion on the original PR >>>>>>>>> >>>>>>>>> https://github.com/ceph/ceph/pull/10218 >>>>>>>>> >>>>>>>>> but since it's closed and almost nobody will see it I'm moving the >>>>>>>>> discussion here. >>>>>>>>> >>>>>>>>> The main news is that I have a simple adjustment for the weights that >>>>>>>>> works (almost perfectly) for the 2nd round of placements. The solution is >>>>>>>>> pretty simple, although as with most probabilities it tends to make my >>>>>>>>> brain hurt. >>>>>>>>> >>>>>>>>> The idea is that, on the second round, the original weight for the small >>>>>>>>> OSD (call it P(pick small)) isn't what we should use. Instead, we want >>>>>>>>> P(pick small | first pick not small). Since P(a|b) (the probability of a >>>>>>>>> given b) is P(a && b) / P(b), >>>>>>>> >>>>>>>> >From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition >>>>>>>> >>>>>>>>> >>>>>>>>> P(pick small | first pick not small) >>>>>>>>> = P(pick small && first pick not small) / P(first pick not small) >>>>>>>>> >>>>>>>>> The last term is easy to calculate, >>>>>>>>> >>>>>>>>> P(first pick not small) = (total_weight - small_weight) / total_weight >>>>>>>>> >>>>>>>>> and the && term is the distribution we're trying to produce. >>>>>>>> >>>>>>>> https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ? >>>>>>>> >>>>>>>>> For exmaple, >>>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have >>>>>>>>> their second replica be the small OSD. So >>>>>>>>> >>>>>>>>> P(pick small && first pick not small) = small_weight / total_weight >>>>>>>>> >>>>>>>>> Putting those together, >>>>>>>>> >>>>>>>>> P(pick small | first pick not small) >>>>>>>>> = P(pick small && first pick not small) / P(first pick not small) >>>>>>>>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight) >>>>>>>>> = small_weight / (total_weight - small_weight) >>>>>>>>> >>>>>>>>> This is, on the second round, we should adjust the weights by the above so >>>>>>>>> that we get the right distribution of second choices. It turns out it >>>>>>>>> works to adjust *all* weights like this to get hte conditional probability >>>>>>>>> that they weren't already chosen. >>>>>>>>> >>>>>>>>> I have a branch that hacks this into straw2 and it appears to work >>>>>>>> >>>>>>>> This is https://github.com/liewegas/ceph/commit/wip-crush-multipick >>>>>>> >>>>>>> In >>>>>>> >>>>>>> https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316 >>>>>>> >>>>>>> double neww = oldw / (bucketw - oldw) * bucketw; >>>>>>> >>>>>>> I don't get why we need "* bucketw" at the end ? >>>>>> >>>>>> It's just to keep the values within a reasonable range so that we don't >>>>>> lose precision by dropping down into small integers. >>>>>> >>>>>> I futzed around with this some more last week trying to get the third >>>>>> replica to work and ended up doubting that this piece is correct. The >>>>>> ratio between the big and small OSDs in my [99 99 99 99 4] example varies >>>>>> slightly from what I would expect from first principles and what I get out >>>>>> of this derivation by about 1%.. which would explain the bias I as seeing. >>>>>> >>>>>> I'm hoping we can find someone with a strong stats/probability background >>>>>> and loads of free time who can tackle this... >>>>>> >>>>> >>>>> It would help to formulate the problem into a self contained puzzle to present a mathematician. I tried to do it last week but failed. I'll give it another shot and submit a draft, hoping something bad could be the start of something better ;-) >>>> >>>> Here is what I have. I realize this is not good but I'm hoping someone more knowledgeable will pity me and provide something sensible. Otherwise I'm happy to keep making a fool of myself :-) In the following a bin is the device, the ball is a replica and the color is the object id. >>>> >>>> We have D bins and each bin can hold D(B) balls. All balls have the >>>> same size. There is exactly X balls of the same color. Each ball must >>>> be placed in a bin that does not already contain a ball of the same >>>> color. >>>> >>>> What distribution guarantees that, for all X, the bins are filled in >>>> the same proportion ? >>>> >>>> Details >>>> ======= >>>> >>>> * One placement: all balls are the same color and we place each of them >>>> in a bin with a probability of: >>>> >>>> P(BIN) = BIN(B) / SUM(BINi(B) for i in [1..D]) >>>> >>>> so that bins are equally filled regardless of their capacity. >>>> >>>> * Two placements: for each ball there is exactly one other ball of the >>>> same color. A ball is placed as in experience 1 and the chosen bin >>>> is set aside. The other ball of the same color is placed as in >>>> experience 1 with the remaining bins. The probability for a ball >>>> to be placed in a given BIN is: >>>> >>>> P(BIN) + P(all bins but BIN | BIN) >>>> >>>> Examples >>>> ======== >>>> >>>> For instance we have 5 bins, a, b, c, d, e and they can hold: >>>> >>>> a = 10 million balls >>>> b = 10 million balls >>>> c = 10 million balls >>>> d = 10 million balls >>>> e = 1 million balls >>>> >>>> In the first experience with place each ball in >>>> >>>> a with a probability of 10 / ( 10 + 10 + 10 + 10 + 1 ) = 10 / 41 >>>> same for b, c, d >>>> e with a probability of 1 / 41 >>>> >>>> after 100,000 placements, the bins have >>>> >>>> a = 243456 >>>> b = 243624 >>>> c = 244486 >>>> d = 243881 >>>> e = 24553 >>>> >>>> they are >>>> >>>> a = 2.43 % full >>>> b = 2.43 % full >>>> c = 2.44 % full >>>> d = 2.43 % full >>>> e = 0.24 % full >>>> >>>> In the second experience >>>> >>>> >>>>>> sage >>>>>> >>>>>> >>>>>>> >>>>>>>> >>>>>>>>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the >>>>>>>>> current code, you get >>>>>>>>> >>>>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2 >>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2 >>>>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001 >>>>>>>>> device 0: 19765965 [9899364,9866601] >>>>>>>>> device 1: 19768033 [9899444,9868589] >>>>>>>>> device 2: 19769938 [9901770,9868168] >>>>>>>>> device 3: 19766918 [9898851,9868067] >>>>>>>>> device 6: 929148 [400572,528576] >>>>>>>>> >>>>>>>>> which is very close for the first replica (primary), but way off for the >>>>>>>>> second. With my hacky change, >>>>>>>>> >>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2 >>>>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001 >>>>>>>>> device 0: 19797315 [9899364,9897951] >>>>>>>>> device 1: 19799199 [9899444,9899755] >>>>>>>>> device 2: 19801016 [9901770,9899246] >>>>>>>>> device 3: 19797906 [9898851,9899055] >>>>>>>>> device 6: 804566 [400572,403994] >>>>>>>>> >>>>>>>>> which is quite close, but still skewing slightly high (by a big less than >>>>>>>>> 1%). >>>>>>>>> >>>>>>>>> Next steps: >>>>>>>>> >>>>>>>>> 1- generalize this for >2 replicas >>>>>>>>> 2- figure out why it skews high >>>>>>>>> 3- make this work for multi-level hierarchical descent >>>>>>>>> >>>>>>>>> sage >>>>>>>>> >>>>>>>>> -- >>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Loïc Dachary, Artisan Logiciel Libre >>>>>>> -- >>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>> >>>>> >>>> >>>> -- >>>> Loïc Dachary, Artisan Logiciel Libre >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> >>> >> >> -- >> Loïc Dachary, Artisan Logiciel Libre > > > -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html