On Mon, 8 May 2017, Loic Dachary wrote: > On 05/08/2017 05:09 AM, Sage Weil wrote: > > On Sun, 7 May 2017, Loic Dachary wrote: > >> Hi Sage, > >> > >> When choosing the second replica, crush_bucket_choose picks an item and > >> crush_choose_{indep,firstn} checks if it has already been chosen for the > >> first replica. If so, it is discarded as a collision[1], r is modified > >> and another attempt is made to get a different item. If no value is > >> found after choose_tries attempt, the mapping is incomplete. > >> > >> Another way to do the same would be that crush_bucket_choose / > >> bucket_straw2_choose[2] ignores the items that have already been chosen. > >> It could be done by looping over the list but that would mean N (number > >> of replicas) lookups for each item. > >> > >> The current implementation is optimized for the cases where collisions > >> are rare. However, when the weights of the items are two order of > >> magnitude appart or more, choose_tries has to be set to a very large > >> value (more than 1000) for the mapping to succeed. In practice that is > >> not a problem as it is unlikely that a host is 100 times bigger than > >> another host ;-) > >> > >> When fixing an uneven distribution[3], CRUSH weights are sometime set to > >> values with that kind of difference (1 against 100) to compensate for > >> the probability bias and/or a small number of samples. For instance when > >> there are 5 hosts with weights 5 1 1 1 1, modifying the weight set > >> fails. It goes as far as 8.9, 0.01, 0.01, 0.01, 0.01 with choose_tries > >> 2000. The value of choose_tries has to be increased a lot for a gain > >> that is smaller and smaller and the CPU usage goes up. In this > >> situation, an alternative implementation of the CRUSH collision seems a > >> good idea. > >> > >> Instead of failing after choose_tries attempts, crush_bucket_choose > >> could be called with the list of already chosen items and loop over them > >> to ensure none of them are candidate. The result will always be correct > >> but the implementation more expensive. The default choose_tries could > >> even be set to a lower value (19 instead of 50 ?) since it is likely > >> more expensive to handle 30 more collisions rather than looping over > >> each item already selected. > >> > >> What do you think ? > > > > I think this direction is promising! The problem is that I think it's not > > quite as simple as you suggest, since you may be choosing over multiple > > levels of a hierarchy. If the weight tree is something like > > > > 4 > > / \ > > 2 2 > > / \ / \ > > 1 1 1 1 > > a b c d > > > > and you chose a, then yes, if you get back into the left branch you can > > filter it out of the straw2 selections. And num_rep is usually small so > > that won't be expensive. But you also need the first choice at the top > > level of the hierarchy to weight the left *tree* with 1 instead of 2. > > I don't understand why this is necessary ? Here are the scenarios I have in mind: > > > / \ > / \ > r1 4 r2 4 rack (failure domain) > / \ / \ > 2 2 2 2 host > / \ / \ / \ / \ > 1 1 1 1 1 1 1 1 device > a b c d e f g h > > Say value 10 ends up in a the first time, it first went through rack > r1 which is the failure domain. If value 10 also ends up in r1 the > second time, straw2 will skip/collide it at that level because r1 is > stored in out while a is stored in out2. > > There only case I can think of that requires collision to be resolved > in a higher hierarchical level is when there is no alternative. > > > > / \ > / \ > r1 4 r2 4 rack > / / \ > h1 2 2 2 host (failure domain) > / / \ / \ > 1 1 1 1 1 device > a e f g h > > If 10 ends up in h1 the first time and the second time, it will > collide because there is no alternative. It will then retry_descent, > ftotal increases which goes into r and it gets another chance at landing on a host > that's not h1. The problem isn't when choose[leaf] is specifying the intermediate level (rack or host in your examples); it's when there is an intervening level that a single choose is crossing. In your first tree, root 6 > / \ > / \ > r1 4 r2 4 rack > / \ / \ > h1 2 h2 2 h3 2 h4 2 host (failure domain) > / \ / \ / \ / \ > 1 1 1 1 1 1 1 1 device > a b c d e f g h let's say the failure domain is the host instead of the rack. If we choose a (h1) the first time, for subsequent descents from the root we still pick r1 and r2 equally (4 vs 4) even though r1 only has 1 usable host (h2). This normally is fine because we reject the entire descent for 50% of the r1 choices, so *effectively* r1's weight is only 2 (it's as if the other attempts never happened). But with the smarter straw2 choose h2 100% for r1 and you'll end up with 2x more tiems for that second position on h2 than you want. Does that make sense? sage > > I must be missing a use case :-) > > > > > > I think this could be done by adjusting the hierarchical weights as you go > > (and I think one of Adam's early passes at the multipick problem did > > something similar), but it's a bit more complex. > > > > It seems worth pursuing, though! > > > > And dynamically doing this only after the first N 'normal' attempts fail > > seems like a good way to avoid slowing down the common path (no > > collisions). I suspect the optimal N is probably closer to 5 than 19, > > though? > > > > sage > > > > > >> > >> Cheers > >> > >> P.S. Note that even in this border case modifying the weights to 7.1, > >> 0.5, 0.5, 0.4, 0.4 significantly improves the distribution (twice better > >> instead of ten times better). Only it cannot do better because it hits a > >> limitation of the current CRUSH implementation. But it looks like it is > >> not too difficult to fix. > >> > >> > >> [1] https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L541 > >> [2] https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L332 > >> [3] http://marc.info/?l=ceph-devel&m=149407691823750&w=2 > >> -- > >> Loïc Dachary, Artisan Logiciel Libre > >> > > -- > Loïc Dachary, Artisan Logiciel Libre > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > >