Best practice K/M-parameters EC pool

loic@xxxxxxxxxxx (Loic Dachary) · Fri, 15 Aug 2014 15:06:49 +0200

On 15/08/2014 14:36, Erik Logtenberg wrote:
>>>>> Now, there are certain combinations of K and M that appear to have more
>>>>> or less the same result. Do any of these combinations have pro's and
>>>>> con's that I should consider and/or are there best practices for
>>>>> choosing the right K/M-parameters?
>>>>>
>>>
>>> Loic might have a better anwser, but I think that the more segments (K)
>>> you have, the heavier recovery. You have to contact more OSDs to
>>> reconstruct the whole object so that involves more disks doing seeks.
>>>
>>> I heard sombody from Fujitsu say that he thought 8/3 was best for most
>>> situations. That wasn't with Ceph though, but with a different system
>>> which implemented Erasure Coding.
>>
>> Performance is definitely lower with more segments in Ceph.  I kind of
>> gravitate toward 4/2 or 6/2, though that's just my own preference.
> 
> This is indeed the kind of pro's and con's I was thinking about.
> Performance-wise, I would expect differences, but I can think of both
> positive and negative effects of bigger values for K.
> 
> For instance, yes recovery takes more OSD's with bigger values of K, but
> it seems to me that there are also less or smaller items to recover.
> Also read-performance generally appears to benefit from having a bigger
> cluster (more parallellism), so I can imagine that bigger values of K
> also provide an increase in read-performance.
> 
> Mark says more segments hurts performance though, are you referring just
> to rebuild-performance or also basic operational performance (read/write)?
> 
>>>>> For instance, if I choose K = 3 and M = 2, then pg's in this pool will
>>>>> use 5 OSD's and sustain the loss of 2 OSD's. There is 40% overhead in
>>>>> this configuration.
>>>>>
>>>>> Now, if I were to choose K = 6 and M = 4, I would end up with pg's that
>>>>> use 10 OSD's and sustain the loss of 4 OSD's, which is statistically
>>>>> not
>>>>> so much different from the first configuration. Also there is the same
>>>>> 40% overhead.
>>>>
>>>> Although I don't have numbers in mind, I think the odds of loosing two
>>>> OSD simultaneously are a lot smaller than the odds of loosing four OSD
>>>> simultaneously. Or am I misunderstanding you when you write
>>>> "statistically not so much different from the first configuration" ?
>>>>
>>>
>>> Loosing two smaller then loosing four? Is that correct or did you mean
>>> it the other way around?
>>>
>>> I'd say that loosing four OSDs simultaneously is less likely to happen
>>> then two simultaneously.
>>
>> This is true, though the more disks you spread your objects across, the
>> higher likelihood that any given object will be affected by a lost OSD.
>>  The extreme case being that every object is spread across every OSD and
>> losing any given OSD affects all objects.  I suppose the severity
>> depends on the relative fraction of your erasure coding parameters
>> relative to the total number of OSDs.  I think this is perhaps what Erik
>> was getting at.
> 
> I haven't done the actual calculations, but given some % chance of disk
> failure, I would assume that losing x out of y disks has roughly the
> same chance as losing 2*x out of 2*y disks over the same period.
> 
> That's also why you generally want to limit RAID5 arrays to maybe 6
> disks or so and move to RAID6 for bigger arrays. For arrays bigger than
> 20 disks you would usually split those into separate arrays, just to
> keep the (parity disks / total disks) fraction high enough.
> 
> With regard to data safety I would guess that 3+2 and 6+4 are roughly
> equal, although the behaviour of 6+4 is probably easier to predict
> because bigger numbers makes your calculations less dependent on
> individual deviations in reliability.
> 
> Do you guys feel this argument is valid?

Here is how I reason about it, roughly:

If the probability of loosing a disk is 0.1%, the probability of loosing two disks simultaneously (i.e. before the failure can be recovered) would be 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001% 

Accurately calculating the reliability of the system as a whole is a lot more complex (see https://wiki.ceph.com/Development/Add_erasure_coding_to_the_durability_model/ for more information).

Cheers

> 
> Erik.
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Lo?c Dachary, Artisan Logiciel Libre

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 263 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140815/3ecc9439/attachment.pgp>