=?gb18030?b?u9i4tKO6IFN0cmF3MiBCdWNrZXQgcmVsYXRlZCBx?==?gb18030?b?dWVzdGlvbnM=?=

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi, 


Here is my understanding for the straw2 algorithms, may it would help:


The straw2 algorithms is pseudo-random route algorithms for each pg to determine which osd to store it. Maybe the straw2 algorithms’ name comes from the phrase `drawing straws` https://en.wikipedia.org/wiki/Drawing_straws, as the process for determining which osd to store the pg is all osds drawing straws, and the osd who gets the longest/highest straw wins and store the pg. Each osd’s straw is computed by its own characteristics, it has nothing to do with other osds, so when adding or removing osd,  the least pgs will be moved.


And we know, in ceph's cluster, we can have osds with different disk size, for example some osds with 4TB disk size, and some osds with 8TB disk size, how the straw2 algorithms distribute all the pgs among these osd to keep 8TB’s osd has two times pgs than 4TB’s osd. Here is the brilliant, for each osd, there is a `crush weight`, for example, the crush weight of 8TB’s osd is about 8, and the crush weight of 4TB’s osd is about 4. And if each straw is computed by using uniform distribution between 0 and crush weight, the probability of 8TB's osd winning over 4TB's osd is MORE THAN two times. But if each straw is computed by using exponential distribution with the crush weight as its lambda λ, the probability of 8TB’s osd winning over 4TB’s osd is JUST two times.  The main code lies: 


https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L334


And beside crush weight, there is `osd weight` for each osd, osd weight is between 0.0 and 1.0. If osd weight is zero, it means this osd is out, and if osd weight is 1.0, this means is fully in. If the osd who's state is out wins the draw straws process, this pg will be retryed to choose a different osd. And if osd weight is between this, for example, if osd weight is 0.8, this mean this osd is 80% partially in.  This means 20% of all the pgs that this osd wins to store will be retryed to choose a different osd. 


https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L424


Best wishes,
Yao Zongyou

------------------ 原始邮件 ------------------
发件人: "Bobby"<italienisch1987@xxxxxxxxx>;
发送时间: 2020年4月6日(星期一) 上午7:50
收件人: "Sam Just"<sjust@xxxxxxxxxx>;
抄送: "dev"<dev@xxxxxxx>;
主题: Re: Straw2 Bucket related questions


cool ! That helped a lot ! Is there any more source or sources that can explain Straw2 bucket more in detail. Or lets say  mapper.c code walkthrough? 

Bobby

On Fri, Apr 3, 2020 at 11:05 PM Sam Just <sjust@xxxxxxxxxx> wrote:
Here's Sage's initial writeup:
https://www.spinics.net/lists/ceph-devel/msg21635.html
-Sam

On Fri, Apr 3, 2020 at 12:12 PM Bobby <italienisch1987@xxxxxxxxx> wrote:
>
> Hi,
>
> I am trying to understand **Straw2** bucket used in **CRUSH algorithm** of **Ceph**. I have some specific questions. The code is given below:
>
> **Questions:**
>
> - Why there is a need of taking **log** of **hash value**?
> - Is **x** the **placement ps** calculated by **crush_hash32_2** function?
> - What function **crush_ln()** in the given code (mapper.c) is actually computing? I am confused by the comment **2^44*log2(input+1)**.
> - Why there is a need of creating a negative number based on **ln (natural log)** of hash value?
>
> Please help me understand these points.
>
> Thanks in advance
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux