Re: understanding bucket_straw2_choose

Loic Dachary <loic@xxxxxxxxxxx> · Tue, 6 Jun 2017 18:47:32 +0200

On 06/06/2017 03:42 PM, Sage Weil wrote:
> On Tue, 6 Jun 2017, Loic Dachary wrote:
>> On 06/05/2017 08:13 PM, Sage Weil wrote:
>>> On Mon, 5 Jun 2017, Loic Dachary wrote:
>>>> Hi Sage,
>>>>
>>>> The following lines of bucket_straw2_choose are part black magic to me and I would very much appreciate
>>>> pointers to better understand what they do.
>>>>
>>>> 			u = crush_hash32_3(bucket->h.hash, x, ids[i], r);
>>>> 			u &= 0xffff;
>>>>
>>>> 			/*
>>>> 			 * for some reason slightly less than 0x10000 produces
>>>> 			 * a slightly more accurate distribution... probably a
>>>> 			 * rounding effect.
>>>> 			 *
>>>> 			 * the natural log lookup table maps [0,0xffff]
>>>> 			 * (corresponding to real numbers [1/0x10000, 1] to
>>>> 			 * [0, 0xffffffffffff] (corresponding to real numbers
>>>> 			 * [-11.090355,0]).
>>>> 			 */
>>>> 			ln = crush_ln(u) - 0x1000000000000ll;
>>>>
>>>> I understand the general idea (I think) which is to hash the parameters 
>>>> and calculate the log of the hash so that it can be multiplied by the 
>>>> weight without overflowing. What I don't get is the details such as: why 
>>>> &= 0xffff ? I suppose that's because crush_ln is only capable of 
>>>> returning the log for this range of values.
>>>
>>> Yeah this is somewhat arbitrary, but probably because in general all of 
>>> these values are 16.16 fixed point and [0,0x10000] is [0,1.0].  It 
>>> probably also originally was picked because it set the size of the 
>>> original lookup table.  (crush_ln() is now a more complicated function 
>>> to approximate ln.)
>>>
>>>> If that's the case how come 
>>>> it does not hurt the hashing function (in the sense that it increases 
>>>> the likelyhood that values have very similar hash) because it discards 
>>>> the higher bits?
>>>
>>> Precision here isn't all that important; these values are just determining 
>>> relative weights between different choices, but even with a very imprecise 
>>> hash function it would all average out over a large number of inputs.
>>>
>>>> Also why is it necessary to subtract 0x1000000000000ll 
>>>> ? And why this number and not another one ?
>>>
>>> crush_ln gives you [0,0xffffffffffff]; the subtraction just makes it a 
>>> negative fixed-point value before doing the division (corresponding to 
>>> [-11,0]).  
>>
>> What kind of fixed-point value is this ? I understand Q16.16 represented
>> in a u32. But here we have s64 out of which 48 bits are used and this confuses me.
> 
> The real equation is
> 
>  straw length = ln(x) / weight
> 
> where x is some random value in (0,1].
> 
> You can think of x as a 16.16 fixed-point value stored in a 64-bit 
> integer, but it doesn't really matter... it's just that the crush_ln value 
> is expecting 16-bits of input to [1/0x10000,1].  That starting value 
> ln(.000015) happens to be about -11, and ln(1) is 0, so the output range 
> in real numbers is roughly [-11,0].  crush_ln(x) *actually* returns a 
> value from [0,0x1000000000000ll], so we put it in an s64 and subtract 
> 0x1000000000000ll so that it's a negative s.48 fixed-point value, we can 
> divide by a 16.16 fixed-point weight value, and get a result that has a 
> reasonable amount of precision left.
> 
> The ranges for these fixed-point values are semi-arbitrarily chosen, 
> partly to get reasonable precision, and partly because that's what made 
> sense to me at the time. For example the crush_ln() could have been 
> adjusted to return a value that was already a negative value; the reason 
> it looks like it does it because it was originally implemented as a simple 
> lookup table that mapped (0,1) as [0,0xffff] to [-11,0] as 
> [0,0xffffffffffff] and I did the adjustment back into "real" fixed-point 
> values in the caller.  I can't remember excatly why that made the 
> most sense at the time.  :)

It makes sense now !

>> I'm also confused by the fact that crush_ln(x) returns a positive number from which 0x1000000000000ll 
>> is subtracted to make it a negative number. log(x) where x is [0,1[ is a negative number already. 
>> Unless crush_ln does not really return the log but something related to it ?
>>
>> I think I'm lost in the real number representations :-)
>>
>>> Without shifting the values everything skews toward the low end 
>>> instead of the high end when you divide.

I'm not sure I understand why. 

If the hash is on 16 bits (&0xffff) multiplying it by the weight u32 which is in the range [0, 0xffff ffff]
will produce a result in the range [0, 0xffff ffff ffff]. The weights used in Ceph are all grouped around 0x10000 and unlikely to be in outside of the range [0x1000, 0x1000000]. But I don't see why it would cause a problem for the distribution.

I'm missing something important but I can't figure out what.

Thanks a lot for the explanations :-)

>>>
>>> sage
>>>
>>>
>>>>
>>>> There is no immediate urgency to answer these questions, of course. I'm 
>>>> asking because I worked on tests to verify the C implementation of the 
>>>> algorithm to balance buckets and got into trouble. I think the reason is 
>>>> because I used straw values that are outside of the possible range for 
>>>> ln. But maybe it is because they are in the range and their distribution 
>>>> is very unlikely. Or something else ... ;-)
>>>>
>>>> Cheers
>>>>
>>>> -- 
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>
>> -- 
>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

-- 
Loïc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html