Re: Calculating the expected PGs distribution

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 05/11/2017 03:52 AM, Xavier Villaneau wrote:
> Le 05/10/2017 à 02:45 AM, Loic Dachary a écrit :
>> The best we can do is to lower the weight of the overweight item to get a set of weights that can lead to an even distribution. We cannot fix the fact that the overweight item will be underfilled. But by fixing the weights we can fix the fact that the other items are overfilled which is what causes problems to the sysadmin.
> I'm not sure that's the correct approach; over/under filled are relative notions, meaning that if there are drives filling up slower than the average cluster usage, then there must be drives filling up faster. Whatever happens, that first large OSD won't be full when the other drives are.
>> In our example (with 2 replicas)

This was not the correct approach indeed.

>>
>> ( 5 + 1 + 1 + 1 + 1 ) / 2 = 4.5 therefore all items with a weight > 4.5 are overweight
>>
>> we remove the overweight items and sum the weight of the remaining items:
>>
>> ( 1 + 1 + 1 + 1 ) = 4
>>
>> and we divide by the number of replicas minus the number of overweight items
>>
>> 4 / ( 2 - 1 ) = 4
>>
>> and we set the weight of the overweight item to this number
>>
>> ( 4 + 1 + 1 + 1 + 1 ) / 2 = 4 therefore all items are <= maximum weight
> Actually, taking those weights would be worse: Because the weight of the large drive is lower, Crush will favor putting data on the other drives a bit more and will lead to them filling up faster.
> 
> In the details with 4 1 1 1 1, numbered 0 through 4 with 2 replicas and 100 PGs:
> - There is a 4/8 = 50% chance of picking 0 as a primary drive, so that's 50 PGs using it already.
> - There is a 12.5% (1/8) chance of picking 1 as primary. Once that is done, the remaining weights are 4 . 1 1 1 for a total of 7, meaning there is a 4/7 = 57% chance of picking 0 as second OSD. Therefore the chance of mapping to [1, 0] is 7.14%. Same for 2, 3 and 4 as primary, so the total chance of having 0 as second OSD is 28.6%.
> On average, 78.6 PGs will be mapped to OSD 0 in either position. If we round up the numbers nicely, the final expected PGs for 4 1 1 1 1 is around 80 30 30 30 30. This is indeed more even, but actually worse for drive usage if the size ratios are 5 1 1 1 1.

Right. The corrected weights are only useful for one thing in the end: separating how much space is lost on a given device because of the excessive weight from the space that is lost because the distribution is uneven and could be optimized.

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux