Re: Comments on Ceph distributed parity implementation

Loic Dachary <loic@xxxxxxxxxxx> · Tue, 18 Jun 2013 17:46:56 +0200

Hi Harvey,

On 06/18/2013 04:31 PM, Harvey Skinner wrote:
> all,
> 
> 
> nice discussion on implementing  erasure-encoding as an option in Ceph for
> object store.   THis will be a great feature option.   I would recommend
> that a good default be implemented  as being discussed here, but also that
> an API or plug-in be architected at the same time so the deployment could
> also use some other form of parity/erasure-encoding if required/desired.
> Governments and other vendors will push/convince (or have hardcoded in an
> RFP) that some other algorithm be used and allowing an API/plug-in
> capability will allow Ceph to still be applicable. There are some
> proprietary object store solutions today which allow other
> erasure-encode/parity plug-ins which you will want to be competitive with.
> 

For most coding libraries it looks like an abstract API such as:

// extra == GF(2^8) for instance, i.e. parameters dependant on the kind of encoding function
context = initialize(int k, int m, void* extra)
// code block into k+m chunks
code(context, char* block, char** chunks)
// decode k+m chunks into block and if chunk[i] is erased[i], repair it
decode(context, char** chunks, char* block, int* erased)

is all we need. However, if hierarchical codes are to be considered, the code() function will need information about the location of the object ( the crushmap maybe ?) and probably output additional information to be used when coding other objects, not just chunks. 

What do you think ? 

> I would see initial deployments of erasure-encoded object store be used for
> static data objects; block device usage would come later if proved to be as
> performant as replicated objects.   Replication as done now in Ceph may
> still be a better block storage solution choice for DR strategies,
> configuring replicas to be at different DR site(s).

I suspect you're correct : using erasure coding in a pool used by rgw is likely to be the first actual use case.

> 
> just my view of course …
> 
> Harvey
> 
> 
>> -----Original Message-----
>> From: ceph-devel-owner@xxxxxxxxxxxxxxx
>> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Paul Von-Stamwitz
>> Sent: Friday, June 14, 2013 6:12 PM
>> To: Loic Dachary; Martin Flyvbjerg
>> Cc: ceph-devel@xxxxxxxxxxxxxxx
>> Subject: RE: Comments on Ceph distributed parity implementation
>>
>> Hi Loic and Martin,
>>
>> This is a great discussion and I agree that the performance ramifications
>> of erasure coding need to be thought out very carefully. Since chunks are
>> not only encoded, but distributed across the cluster, we need to pay
>> attention to the network overhead as well as the arithmetic involved in the
>> encoding/decoding.
>>
>> If I understand the proposal correctly, objects begin their life's journey
>> replicated as normal. As it grows cold, it gets transformed to an encode PG
>> in another pool. Subsequent reads will be redirected (ehh). Subsequent
>> writes will first decode the original object and re-replicate it (ouch!).
>> Client writes never are encoded on the fly; they are always replicated
>> (nice).
>>
>> So...
>> encode() is run as a low-priority background process probably once a week
>> during deep scrubs.
>> decode() should be rare (if not, the object shouldn't have been encoded in
>> the first place.) If the cluster is healthy no arithmetic is needed, just
>> concatenation, but a lot of network activity.
>> repair() operations will be the most prevalent and may occur at any time
>> during normal self-healing/rebalancing operations.
>>
>> Therefore, in my opinion, the algorithm that we choose must be optimized
>> for repairing damaged and missing chunks. The main problem I have with
>> Reed-Solomon is that it uses MDS codes which maximize network activity for
>> recalculations. Pyramid codes have the same write (encode) overhead, but
>> have better read (repair) overhead.
>>
>> Loic, I know nothing about Mojette Transforms. From what little I gleaned,
>> it might be good for repair (needing only a subset of chunks within a range
>> to recalculate a missing chunk) but I'm worried about the storage
>> efficiency. RozoFS claims 1.5x. I'd like to do better than that.
>>
>> All the best,
>> Paul
>>
>> On 06/14/2013 3:57 PM, Loic Dachary wrote:
>>> Hi Martin,
>>>
>>> Your explanations are very helpful to better understand the tradeoffs
>>> of the existing implementations. To be honest I was looking forward to
>>> your intervention. Not you specifically, of course :-) But someone
>>> with a good theoretical background to be a judge of what's best in the
>>> context of Ceph.
>>> If you say it's the upcoming library to be released in August 2013,
>>> I'll take your word for it.
>>>
>>> The work currently being done within Ceph is to architecture to
>>> storage backend ( namely placement groups ) to make room for distributed
>>> parity.
>>> My initial idea was to isolate the low level library under an API that
>>> takes a region ( 16KB for instance, as in gf_unit.c found in
>>> http://web.eecs.utk.edu/~plank/plank/papers/CS-13-
>>> 703/gf_complete_0.1.tar ) as input and outputs chunks that can then be
>>> written on different hosts. For instance
>>>
>>>       encode(char* region, char** chuncks) => encode the region into N
>>> chuncks
>>>       decode(char** chunks, char* region) => decode the N chuncks into a
>>> region
>>>         repair(char** chunks, int damaged) => repair the damaged
>>> chunck
>>>
>>> Do you think it is a sensible approach ? And if you do, will I find
>>> examples of such higher level functions in
>>> http://web.eecs.utk.edu/~plank/plank/papers/CS-13-
>>> 703/gf_complete_0.1.tar ? Or elsewhere ?
>>>
>>> I'm a little confused about the relation between GF complete ( as
>>> found at
>>> http://web.eecs.utk.edu/~plank/plank/papers/CS-13-
>>> 703/gf_complete_0.1.tar ) which is very recent ( 2013 ) and Jerasure (
>>> as found at
>>> http://web.eecs.utk.edu/~plank/plank/papers/CS-08-627/Jerasure-
>>> 1.2.tar )  which is comparatively older ( 2008 ). Do you know how
>>> Jerasure
>>> 2.0 relates to GF complete ?
>>>
>>> For completeness, here is a thread with pointers to Mojette Transform
>>> that's being used as part of Rozofs.
>>>
>>> http://www.mail-archive.com/ceph-devel@xxxxxxxxxxxxxxx/msg14666.html
>>>
>>> I'm not able to compare it with the other libraries because it seems
>>> to take a completely different approach. Do you have an opinion about it
>>> ?
>>>
>>> As Patrick mentioned, I'll be at http://www.oscon.com/oscon2013 next
>>> month but I'd love to understand more about this as soon as possible
>>> :-)
>>>
>>> Cheers
>>>
>>> P.S. Updated
>>> http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding
>>> _as_ a_storage_backend#Erasure_Encoded_Storage with a link to
>>> http://web.eecs.utk.edu/~plank/plank/www/software.html for the record
>>>
>>> On 06/14/2013 10:13 PM, Martin Flyvbjerg wrote:
>>>> Dear Community
>>>> I am a young engineer (not software or math, please bare with me)
>>>> with
>>> some suggestions regarding erasure codes. I never used Ceph before or
>>> any other distributed file system.
>>>>
>>>> I stumped upon the suggestion for adding erasure codes to Ceph, as
>>>> described in this article
>>>>
>>> http://wiki.Ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding
>>> _as_
>>> a_storage_backend
>>>>
>>>> first I would like to say great initiative to add erasure codes to
>>>> Ceph.
>>>> Ceph needs its own implementation and it have to be done right, I
>>>> cannot
>>> stress this enough, suggested software mentioned in that article would
>>> result in very low performance.
>>>>
>>>> Why?
>>>> Reed-Solomon is normally something regarded as being very slow
>>>> compared
>>> to other erasure codes, because the underlying Galois-Field
>>> multiplication is slow. Please see video at usenix.org forexplanation.
>>>>
>>>> The implementations of Zfec library and other suggested software the
>>> others rely on the Vandermonde matrix, a matrix used in in
>>> Reed-Solomon erasure codes, a faster approach would be to use the
>>> Cauchy-Reed-Solomon implementation. Please see [1,2,3]
>>>> Although there is something even better, by using the Intel SSE2/3
>>>> SIMD
>>> instructions it is possible to do the as fast as any other XOR based
>>> erasure codes (RaptorQ LT-codes, LDPC etc.).
>>>>
>>>> The suggested FECpp lib uses the optimisation but with a relative
>>>> small
>>> Galois-field only 2^8, since Ceph aimes at unlimited scalability
>>> increasing the size of the Galois-Field would improve performance [4].
>>> Of course the configured Ceph Object Size and/or Stripe width have to
>>> be taken into account.
>>>> Please see
>>>> https://www.usenix.org/conference/fast13/screaming-fast-galois-field
>>>> -
>>> arithmetic-using-sse2-extensions
>>>>
>>>>
>>>> The solution
>>>> Using the GF-Complete open source library [4] to implement
>>>> Reed-Solomon
>>> in Ceph in order to allow Ceph to scale to infinity.
>>>> James S. Plank the author of GF-complete have developed a library
>>> implementing various Reed-Solomon codes called Jerasure.
>>> http://web.eecs.utk.edu/~plank/plank/www/software.html
>>>> Jerasure 2.0 using the GF-complete artimetric based in Intel SSE
>>>> SIMD
>>> instructions, is current in development expected release august 2013.
>>> Will be released under the new BSD license. Jerasure 2.0 also supports
>>> arbitrary Galois-field sizes 8,16,32,64 or 128 bit.
>>>>
>>>> The limit of this implementation would be the processors L2/L3 cache
>>>> not
>>> the underlying arithmetic.
>>>>
>>>> Best Regards
>>>> Martin Flyvbjerg
>>>>
>>>> [1] http://web.eecs.utk.edu/~plank/plank/papers/CS-05-569.pdf
>>>> [2] http://web.eecs.utk.edu/~plank/plank/papers/CS-08-625.pdf
>>>> [3] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2009.pdf
>>>> [4] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2013-GF.pdf
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>> All that is necessary for the triumph of evil is that good people do
>>> nothing.
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

Attachment:
signature.asc

Description: OpenPGP digital signature