Re: crush multipick anomaly

Loic Dachary <loic@xxxxxxxxxxx> · Mon, 20 Feb 2017 20:31:01 +0100

On 02/20/2017 06:32 PM, Gregory Farnum wrote:
> On Mon, Feb 20, 2017 at 12:47 AM, Loic Dachary <loic@xxxxxxxxxxx> wrote:
>>
>>
>> On 02/13/2017 03:53 PM, Gregory Farnum wrote:
>>> On Mon, Feb 13, 2017 at 2:36 AM, Loic Dachary <loic@xxxxxxxxxxx> wrote:
>>>> Hi,
>>>>
>>>> Dan van der Ster reached out to colleagues and friends and Pedro López-Adeva Fernández-Layos came up with a well written analysis of the problem and a tentative solution which he described at : https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>
>>>> Unless I'm reading the document incorrectly (very possible ;) it also means that the probability of each disk needs to take in account the weight of all disks. Which means that whenever a disk is added / removed or its weight is changed, this has an impact on the probability of all disks in the cluster and objects are likely to move everywhere. Am I mistaken ?
>>>
>>> Keep in mind that in the math presented, "all disks" for our purposes
>>> really means "all items within a CRUSH bucket" (at least, best I can
>>> tell). So if you reweight a disk, you have to recalculate weights
>>> within its bucket and within each parent bucket, but each bucket has a
>>> bounded size N so the calculation should remain feasible. I didn't
>>> step through the more complicated math at the end but it made
>>> intuitive sense as far as I went.
>>
>> When crush chooses the second replica it ensures it does not land on the same host, rack etc. depending on the step CHOOSE* argument of the rule. When looking for the best weights (in the updated https://github.com/plafl/notebooks/blob/master/converted/replication.pdf versions) I think we would focus on the host weights (assuming the failure domain is the host) and not the disk weights. When drawing disks after the host was selected, the probabilities of each disk should not need to be modified because there will never be a rejection at that level (i.e. no conditional probability).
> 
> Well, you'd have changed the number of disks, so you'd need to
> recalculate within the host that got a new disk added. And then you'd
> need to recalculate the host and its peer buckets, and if it was in a
> rack then the rack and its peer buckets, and on up the chain.

I meant to say that you do not need to change the weight of the disks within other hosts. But you need to change the weight of all other hosts, not just the host in which a new disk was inserted/removed.

> 
>>
>> If the failure domain is the host I think the crush map should be something like:
>>
>> root:
>>    host1:
>>      disk1
>>      disk2
>>    host2:
>>      disk3
>>      disk4
>>    host3:
>>      disk5
>>      disk6
>>
>> Introducing racks such as in:
>>
>> root:
>>  rack0:
>>    host1:
>>      disk1
>>      disk2
>>    host2:
>>      disk3
>>      disk4
>>  rack1:
>>    host3:
>>      disk5
>>      disk6
>>
>> Is going to complicate the problem further, for no good reason other than a pretty display / architecture reminder.
> 
> Well, there's not much point if you're replicating across hosts, since
> the rack layer is very unbalanced here. But that's essentially a
> misconfiguration which is going to cause problems with any CRUSH-like
> system.
> 
> 
>> Since rejecting a second replica on host3 means it will land in rack0 instead of rack1, I think the probability distribution of the racks will need to be adjusted in the same way the probabilty distribution of the failure domain buckets need to.
> 
> I think maybe you're saying what I did before? "All disks" for our
> purposes really means "all items within a CRUSH bucket". The racks are
> CRUSH items within the root bucket.
> -Greg
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html