Best practice K/M-parameters EC pool

erik@xxxxxxxxxxxxx (Erik Logtenberg) · Fri, 15 Aug 2014 14:36:29 +0200

>>>> Now, there are certain combinations of K and M that appear to have more
>>>> or less the same result. Do any of these combinations have pro's and
>>>> con's that I should consider and/or are there best practices for
>>>> choosing the right K/M-parameters?
>>>>
>>
>> Loic might have a better anwser, but I think that the more segments (K)
>> you have, the heavier recovery. You have to contact more OSDs to
>> reconstruct the whole object so that involves more disks doing seeks.
>>
>> I heard sombody from Fujitsu say that he thought 8/3 was best for most
>> situations. That wasn't with Ceph though, but with a different system
>> which implemented Erasure Coding.
> 
> Performance is definitely lower with more segments in Ceph.  I kind of
> gravitate toward 4/2 or 6/2, though that's just my own preference.

This is indeed the kind of pro's and con's I was thinking about.
Performance-wise, I would expect differences, but I can think of both
positive and negative effects of bigger values for K.

For instance, yes recovery takes more OSD's with bigger values of K, but
it seems to me that there are also less or smaller items to recover.
Also read-performance generally appears to benefit from having a bigger
cluster (more parallellism), so I can imagine that bigger values of K
also provide an increase in read-performance.

Mark says more segments hurts performance though, are you referring just
to rebuild-performance or also basic operational performance (read/write)?

>>>> For instance, if I choose K = 3 and M = 2, then pg's in this pool will
>>>> use 5 OSD's and sustain the loss of 2 OSD's. There is 40% overhead in
>>>> this configuration.
>>>>
>>>> Now, if I were to choose K = 6 and M = 4, I would end up with pg's that
>>>> use 10 OSD's and sustain the loss of 4 OSD's, which is statistically
>>>> not
>>>> so much different from the first configuration. Also there is the same
>>>> 40% overhead.
>>>
>>> Although I don't have numbers in mind, I think the odds of loosing two
>>> OSD simultaneously are a lot smaller than the odds of loosing four OSD
>>> simultaneously. Or am I misunderstanding you when you write
>>> "statistically not so much different from the first configuration" ?
>>>
>>
>> Loosing two smaller then loosing four? Is that correct or did you mean
>> it the other way around?
>>
>> I'd say that loosing four OSDs simultaneously is less likely to happen
>> then two simultaneously.
> 
> This is true, though the more disks you spread your objects across, the
> higher likelihood that any given object will be affected by a lost OSD.
>  The extreme case being that every object is spread across every OSD and
> losing any given OSD affects all objects.  I suppose the severity
> depends on the relative fraction of your erasure coding parameters
> relative to the total number of OSDs.  I think this is perhaps what Erik
> was getting at.

I haven't done the actual calculations, but given some % chance of disk
failure, I would assume that losing x out of y disks has roughly the
same chance as losing 2*x out of 2*y disks over the same period.

That's also why you generally want to limit RAID5 arrays to maybe 6
disks or so and move to RAID6 for bigger arrays. For arrays bigger than
20 disks you would usually split those into separate arrays, just to
keep the (parity disks / total disks) fraction high enough.

With regard to data safety I would guess that 3+2 and 6+4 are roughly
equal, although the behaviour of 6+4 is probably easier to predict
because bigger numbers makes your calculations less dependent on
individual deviations in reliability.

Do you guys feel this argument is valid?

Erik.