Re: Proposal for a CRUSH collision fallback

Ning Yao <zay11022@xxxxxxxxx> · Fri, 19 May 2017 00:48:36 +0800

Hi, Loic

we already notice this problem before and find that it is not easy to
solved.  The problem not only occurs in collision case, but also in
rejection when is_out is called:

static int is_out(const struct crush_map *map,
 const __u32 *weight, int weight_max,
 int item, int x)
{
if (item >= weight_max)
return 1;
if (weight[item] >= 0x10000)
return 0;
if (weight[item] == 0)
return 1;
if ((crush_hash32_2(CRUSH_HASH_RJENKINS1, x, item) & 0xffff)
   < weight[item])
return 0;
return 1;
}

That means if we use ceph osd out [osd.id], or osd is marked out
automatically when it is down, or we use reweight-by-utilization to
alter the reweight for an osd.
I think it may  use the same way as Sage mentioned to solve this problem.
Regards
Ning Yao

2017-05-08 22:45 GMT+08:00 Loic Dachary <loic@xxxxxxxxxxx>:
>
>
> On 05/08/2017 03:42 PM, Sage Weil wrote:
>> On Mon, 8 May 2017, Loic Dachary wrote:
>>> On 05/08/2017 05:09 AM, Sage Weil wrote:
>>>> On Sun, 7 May 2017, Loic Dachary wrote:
>>>>> Hi Sage,
>>>>>
>>>>> When choosing the second replica, crush_bucket_choose picks an item and
>>>>> crush_choose_{indep,firstn} checks if it has already been chosen for the
>>>>> first replica. If so, it is discarded as a collision[1], r is modified
>>>>> and another attempt is made to get a different item. If no value is
>>>>> found after choose_tries attempt, the mapping is incomplete.
>>>>>
>>>>> Another way to do the same would be that crush_bucket_choose /
>>>>> bucket_straw2_choose[2] ignores the items that have already been chosen.
>>>>> It could be done by looping over the list but that would mean N (number
>>>>> of replicas) lookups for each item.
>>>>>
>>>>> The current implementation is optimized for the cases where collisions
>>>>> are rare. However, when the weights of the items are two order of
>>>>> magnitude appart or more, choose_tries has to be set to a very large
>>>>> value (more than 1000) for the mapping to succeed. In practice that is
>>>>> not a problem as it is unlikely that a host is 100 times bigger than
>>>>> another host ;-)
>>>>>
>>>>> When fixing an uneven distribution[3], CRUSH weights are sometime set to
>>>>> values with that kind of difference (1 against 100) to compensate for
>>>>> the probability bias and/or a small number of samples. For instance when
>>>>> there are 5 hosts with weights 5 1 1 1 1, modifying the weight set
>>>>> fails. It goes as far as 8.9, 0.01, 0.01, 0.01, 0.01 with choose_tries
>>>>> 2000. The value of choose_tries has to be increased a lot for a gain
>>>>> that is smaller and smaller and the CPU usage goes up. In this
>>>>> situation, an alternative implementation of the CRUSH collision seems a
>>>>> good idea.
>>>>>
>>>>> Instead of failing after choose_tries attempts, crush_bucket_choose
>>>>> could be called with the list of already chosen items and loop over them
>>>>> to ensure none of them are candidate. The result will always be correct
>>>>> but the implementation more expensive. The default choose_tries could
>>>>> even be set to a lower value (19 instead of 50 ?) since it is likely
>>>>> more expensive to handle 30 more collisions rather than looping over
>>>>> each item already selected.
>>>>>
>>>>> What do you think ?
>>>>
>>>> I think this direction is promising!  The problem is that I think it's not
>>>> quite as simple as you suggest, since you may be choosing over multiple
>>>> levels of a hierarchy.  If the weight tree is something like
>>>>
>>>>       4
>>>>     /   \
>>>>    2     2
>>>>   / \   / \
>>>>  1   1 1   1
>>>>  a   b c   d
>>>>
>>>> and you chose a, then yes, if you get back into the left branch you can
>>>> filter it out of the straw2 selections.  And num_rep is usually small so
>>>> that won't be expensive.  But you also need the first choice at the top
>>>> level of the hierarchy to weight the left *tree* with 1 instead of 2.
>>>
>>> I don't understand why this is necessary ? Here are the scenarios I have in mind:
>>>
>>>
>>>         /         \
>>>        /           \
>>>   r1  4        r2   4         rack (failure domain)
>>>     /   \         /   \
>>>    2     2       2     2      host
>>>   / \   / \     / \   / \
>>>  1   1 1   1   1   1 1   1    device
>>>  a   b c   d   e   f g   h
>>>
>>> Say value 10 ends up in a the first time, it first went through rack
>>> r1 which is the failure domain. If value 10 also ends up in r1 the
>>> second time, straw2 will skip/collide it at that level because r1 is
>>> stored in out while a is stored in out2.
>>>
>>> There only case I can think of that requires collision to be resolved
>>> in a higher hierarchical level is when there is no alternative.
>>>
>>>
>>>
>>>         /         \
>>>        /           \
>>>   r1  4        r2   4         rack
>>>     /             /   \
>>> h1 2             2     2      host     (failure domain)
>>>   /             / \   / \
>>>  1             1   1 1   1    device
>>>  a             e   f g   h
>>>
>>> If 10 ends up in h1 the first time and the second time, it will
>>> collide because there is no alternative. It will then retry_descent,
>>> ftotal increases which goes into r and it gets another chance at landing on a host
>>> that's not h1.
>>
>> The problem isn't when choose[leaf] is specifying the intermediate
>> level (rack or host in your examples); it's when there is an intervening
>> level that a single choose is crossing.  In your first tree,
>>
>>           root 6
>>>         /         \
>>>        /           \
>>>   r1  4        r2   4         rack
>>>     /   \         /   \
>>> h1 2  h2 2    h3 2  h4 2      host (failure domain)
>>>   / \   / \     / \   / \
>>>  1   1 1   1   1   1 1   1    device
>>>  a   b c   d   e   f g   h
>>
>> let's say the failure domain is the host instead of the rack.  If we
>> choose a (h1) the first time, for subsequent descents from the root we
>> still pick r1 and r2 equally (4 vs 4) even though r1 only has 1 usable
>> host (h2).  This normally is fine because we reject the entire descent for
>> 50% of the r1 choices, so *effectively* r1's weight is only 2 (it's as if
>> the other attempts never happened).  But with the smarter straw2 choose h2
>> 100% for r1 and you'll end up with 2x more tiems for that second position
>> on h2 than you want.
>>
>> Does that make sense?
>
> Absolutely, thanks for explaining. The easier route as far as getting an even distribution is concerned seems to find a way to calculate when a combination of weights (5 1 1 1 1 with 2 replica for instance) cannot be evenly distributed.
>
> Cheers
>
>>
>> sage
>>
>>
>>>
>>> I must be missing a use case :-)
>>>
>>>
>>>>
>>>> I think this could be done by adjusting the hierarchical weights as you go
>>>> (and I think one of Adam's early passes at the multipick problem did
>>>> something similar), but it's a bit more complex.
>>>>
>>>> It seems worth pursuing, though!
>>>>
>>>> And dynamically doing this only after the first N 'normal' attempts fail
>>>> seems like a good way to avoid slowing down the common path (no
>>>> collisions).  I suspect the optimal N is probably closer to 5 than 19,
>>>> though?
>>>>
>>>> sage
>>>>
>>>>
>>>>>
>>>>> Cheers
>>>>>
>>>>> P.S. Note that even in this border case modifying the weights to 7.1,
>>>>> 0.5, 0.5, 0.4, 0.4 significantly improves the distribution (twice better
>>>>> instead of ten times better). Only it cannot do better because it hits a
>>>>> limitation of the current CRUSH implementation. But it looks like it is
>>>>> not too difficult to fix.
>>>>>
>>>>>
>>>>> [1] https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L541
>>>>> [2] https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L332
>>>>> [3] http://marc.info/?l=ceph-devel&m=149407691823750&w=2
>>>>> --
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html