Hi, Loic we already notice this problem before and find that it is not easy to solved. The problem not only occurs in collision case, but also in rejection when is_out is called: static int is_out(const struct crush_map *map, const __u32 *weight, int weight_max, int item, int x) { if (item >= weight_max) return 1; if (weight[item] >= 0x10000) return 0; if (weight[item] == 0) return 1; if ((crush_hash32_2(CRUSH_HASH_RJENKINS1, x, item) & 0xffff) < weight[item]) return 0; return 1; } That means if we use ceph osd out [osd.id], or osd is marked out automatically when it is down, or we use reweight-by-utilization to alter the reweight for an osd. I think it may use the same way as Sage mentioned to solve this problem. Regards Ning Yao 2017-05-08 22:45 GMT+08:00 Loic Dachary <loic@xxxxxxxxxxx>: > > > On 05/08/2017 03:42 PM, Sage Weil wrote: >> On Mon, 8 May 2017, Loic Dachary wrote: >>> On 05/08/2017 05:09 AM, Sage Weil wrote: >>>> On Sun, 7 May 2017, Loic Dachary wrote: >>>>> Hi Sage, >>>>> >>>>> When choosing the second replica, crush_bucket_choose picks an item and >>>>> crush_choose_{indep,firstn} checks if it has already been chosen for the >>>>> first replica. If so, it is discarded as a collision[1], r is modified >>>>> and another attempt is made to get a different item. If no value is >>>>> found after choose_tries attempt, the mapping is incomplete. >>>>> >>>>> Another way to do the same would be that crush_bucket_choose / >>>>> bucket_straw2_choose[2] ignores the items that have already been chosen. >>>>> It could be done by looping over the list but that would mean N (number >>>>> of replicas) lookups for each item. >>>>> >>>>> The current implementation is optimized for the cases where collisions >>>>> are rare. However, when the weights of the items are two order of >>>>> magnitude appart or more, choose_tries has to be set to a very large >>>>> value (more than 1000) for the mapping to succeed. In practice that is >>>>> not a problem as it is unlikely that a host is 100 times bigger than >>>>> another host ;-) >>>>> >>>>> When fixing an uneven distribution[3], CRUSH weights are sometime set to >>>>> values with that kind of difference (1 against 100) to compensate for >>>>> the probability bias and/or a small number of samples. For instance when >>>>> there are 5 hosts with weights 5 1 1 1 1, modifying the weight set >>>>> fails. It goes as far as 8.9, 0.01, 0.01, 0.01, 0.01 with choose_tries >>>>> 2000. The value of choose_tries has to be increased a lot for a gain >>>>> that is smaller and smaller and the CPU usage goes up. In this >>>>> situation, an alternative implementation of the CRUSH collision seems a >>>>> good idea. >>>>> >>>>> Instead of failing after choose_tries attempts, crush_bucket_choose >>>>> could be called with the list of already chosen items and loop over them >>>>> to ensure none of them are candidate. The result will always be correct >>>>> but the implementation more expensive. The default choose_tries could >>>>> even be set to a lower value (19 instead of 50 ?) since it is likely >>>>> more expensive to handle 30 more collisions rather than looping over >>>>> each item already selected. >>>>> >>>>> What do you think ? >>>> >>>> I think this direction is promising! The problem is that I think it's not >>>> quite as simple as you suggest, since you may be choosing over multiple >>>> levels of a hierarchy. If the weight tree is something like >>>> >>>> 4 >>>> / \ >>>> 2 2 >>>> / \ / \ >>>> 1 1 1 1 >>>> a b c d >>>> >>>> and you chose a, then yes, if you get back into the left branch you can >>>> filter it out of the straw2 selections. And num_rep is usually small so >>>> that won't be expensive. But you also need the first choice at the top >>>> level of the hierarchy to weight the left *tree* with 1 instead of 2. >>> >>> I don't understand why this is necessary ? Here are the scenarios I have in mind: >>> >>> >>> / \ >>> / \ >>> r1 4 r2 4 rack (failure domain) >>> / \ / \ >>> 2 2 2 2 host >>> / \ / \ / \ / \ >>> 1 1 1 1 1 1 1 1 device >>> a b c d e f g h >>> >>> Say value 10 ends up in a the first time, it first went through rack >>> r1 which is the failure domain. If value 10 also ends up in r1 the >>> second time, straw2 will skip/collide it at that level because r1 is >>> stored in out while a is stored in out2. >>> >>> There only case I can think of that requires collision to be resolved >>> in a higher hierarchical level is when there is no alternative. >>> >>> >>> >>> / \ >>> / \ >>> r1 4 r2 4 rack >>> / / \ >>> h1 2 2 2 host (failure domain) >>> / / \ / \ >>> 1 1 1 1 1 device >>> a e f g h >>> >>> If 10 ends up in h1 the first time and the second time, it will >>> collide because there is no alternative. It will then retry_descent, >>> ftotal increases which goes into r and it gets another chance at landing on a host >>> that's not h1. >> >> The problem isn't when choose[leaf] is specifying the intermediate >> level (rack or host in your examples); it's when there is an intervening >> level that a single choose is crossing. In your first tree, >> >> root 6 >>> / \ >>> / \ >>> r1 4 r2 4 rack >>> / \ / \ >>> h1 2 h2 2 h3 2 h4 2 host (failure domain) >>> / \ / \ / \ / \ >>> 1 1 1 1 1 1 1 1 device >>> a b c d e f g h >> >> let's say the failure domain is the host instead of the rack. If we >> choose a (h1) the first time, for subsequent descents from the root we >> still pick r1 and r2 equally (4 vs 4) even though r1 only has 1 usable >> host (h2). This normally is fine because we reject the entire descent for >> 50% of the r1 choices, so *effectively* r1's weight is only 2 (it's as if >> the other attempts never happened). But with the smarter straw2 choose h2 >> 100% for r1 and you'll end up with 2x more tiems for that second position >> on h2 than you want. >> >> Does that make sense? > > Absolutely, thanks for explaining. The easier route as far as getting an even distribution is concerned seems to find a way to calculate when a combination of weights (5 1 1 1 1 with 2 replica for instance) cannot be evenly distributed. > > Cheers > >> >> sage >> >> >>> >>> I must be missing a use case :-) >>> >>> >>>> >>>> I think this could be done by adjusting the hierarchical weights as you go >>>> (and I think one of Adam's early passes at the multipick problem did >>>> something similar), but it's a bit more complex. >>>> >>>> It seems worth pursuing, though! >>>> >>>> And dynamically doing this only after the first N 'normal' attempts fail >>>> seems like a good way to avoid slowing down the common path (no >>>> collisions). I suspect the optimal N is probably closer to 5 than 19, >>>> though? >>>> >>>> sage >>>> >>>> >>>>> >>>>> Cheers >>>>> >>>>> P.S. Note that even in this border case modifying the weights to 7.1, >>>>> 0.5, 0.5, 0.4, 0.4 significantly improves the distribution (twice better >>>>> instead of ten times better). Only it cannot do better because it hits a >>>>> limitation of the current CRUSH implementation. But it looks like it is >>>>> not too difficult to fix. >>>>> >>>>> >>>>> [1] https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L541 >>>>> [2] https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L332 >>>>> [3] http://marc.info/?l=ceph-devel&m=149407691823750&w=2 >>>>> -- >>>>> Loïc Dachary, Artisan Logiciel Libre >>>>> >>> >>> -- >>> Loïc Dachary, Artisan Logiciel Libre >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> > > -- > Loïc Dachary, Artisan Logiciel Libre > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html