Hi Harvey, On 06/18/2013 04:31 PM, Harvey Skinner wrote: > all, > > > nice discussion on implementing erasure-encoding as an option in Ceph for > object store. THis will be a great feature option. I would recommend > that a good default be implemented as being discussed here, but also that > an API or plug-in be architected at the same time so the deployment could > also use some other form of parity/erasure-encoding if required/desired. > Governments and other vendors will push/convince (or have hardcoded in an > RFP) that some other algorithm be used and allowing an API/plug-in > capability will allow Ceph to still be applicable. There are some > proprietary object store solutions today which allow other > erasure-encode/parity plug-ins which you will want to be competitive with. > For most coding libraries it looks like an abstract API such as: // extra == GF(2^8) for instance, i.e. parameters dependant on the kind of encoding function context = initialize(int k, int m, void* extra) // code block into k+m chunks code(context, char* block, char** chunks) // decode k+m chunks into block and if chunk[i] is erased[i], repair it decode(context, char** chunks, char* block, int* erased) is all we need. However, if hierarchical codes are to be considered, the code() function will need information about the location of the object ( the crushmap maybe ?) and probably output additional information to be used when coding other objects, not just chunks. What do you think ? > I would see initial deployments of erasure-encoded object store be used for > static data objects; block device usage would come later if proved to be as > performant as replicated objects. Replication as done now in Ceph may > still be a better block storage solution choice for DR strategies, > configuring replicas to be at different DR site(s). I suspect you're correct : using erasure coding in a pool used by rgw is likely to be the first actual use case. > > just my view of course … > > Harvey > > >> -----Original Message----- >> From: ceph-devel-owner@xxxxxxxxxxxxxxx >> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Paul Von-Stamwitz >> Sent: Friday, June 14, 2013 6:12 PM >> To: Loic Dachary; Martin Flyvbjerg >> Cc: ceph-devel@xxxxxxxxxxxxxxx >> Subject: RE: Comments on Ceph distributed parity implementation >> >> Hi Loic and Martin, >> >> This is a great discussion and I agree that the performance ramifications >> of erasure coding need to be thought out very carefully. Since chunks are >> not only encoded, but distributed across the cluster, we need to pay >> attention to the network overhead as well as the arithmetic involved in the >> encoding/decoding. >> >> If I understand the proposal correctly, objects begin their life's journey >> replicated as normal. As it grows cold, it gets transformed to an encode PG >> in another pool. Subsequent reads will be redirected (ehh). Subsequent >> writes will first decode the original object and re-replicate it (ouch!). >> Client writes never are encoded on the fly; they are always replicated >> (nice). >> >> So... >> encode() is run as a low-priority background process probably once a week >> during deep scrubs. >> decode() should be rare (if not, the object shouldn't have been encoded in >> the first place.) If the cluster is healthy no arithmetic is needed, just >> concatenation, but a lot of network activity. >> repair() operations will be the most prevalent and may occur at any time >> during normal self-healing/rebalancing operations. >> >> Therefore, in my opinion, the algorithm that we choose must be optimized >> for repairing damaged and missing chunks. The main problem I have with >> Reed-Solomon is that it uses MDS codes which maximize network activity for >> recalculations. Pyramid codes have the same write (encode) overhead, but >> have better read (repair) overhead. >> >> Loic, I know nothing about Mojette Transforms. From what little I gleaned, >> it might be good for repair (needing only a subset of chunks within a range >> to recalculate a missing chunk) but I'm worried about the storage >> efficiency. RozoFS claims 1.5x. I'd like to do better than that. >> >> All the best, >> Paul >> >> On 06/14/2013 3:57 PM, Loic Dachary wrote: >>> Hi Martin, >>> >>> Your explanations are very helpful to better understand the tradeoffs >>> of the existing implementations. To be honest I was looking forward to >>> your intervention. Not you specifically, of course :-) But someone >>> with a good theoretical background to be a judge of what's best in the >>> context of Ceph. >>> If you say it's the upcoming library to be released in August 2013, >>> I'll take your word for it. >>> >>> The work currently being done within Ceph is to architecture to >>> storage backend ( namely placement groups ) to make room for distributed >>> parity. >>> My initial idea was to isolate the low level library under an API that >>> takes a region ( 16KB for instance, as in gf_unit.c found in >>> http://web.eecs.utk.edu/~plank/plank/papers/CS-13- >>> 703/gf_complete_0.1.tar ) as input and outputs chunks that can then be >>> written on different hosts. For instance >>> >>> encode(char* region, char** chuncks) => encode the region into N >>> chuncks >>> decode(char** chunks, char* region) => decode the N chuncks into a >>> region >>> repair(char** chunks, int damaged) => repair the damaged >>> chunck >>> >>> Do you think it is a sensible approach ? And if you do, will I find >>> examples of such higher level functions in >>> http://web.eecs.utk.edu/~plank/plank/papers/CS-13- >>> 703/gf_complete_0.1.tar ? Or elsewhere ? >>> >>> I'm a little confused about the relation between GF complete ( as >>> found at >>> http://web.eecs.utk.edu/~plank/plank/papers/CS-13- >>> 703/gf_complete_0.1.tar ) which is very recent ( 2013 ) and Jerasure ( >>> as found at >>> http://web.eecs.utk.edu/~plank/plank/papers/CS-08-627/Jerasure- >>> 1.2.tar ) which is comparatively older ( 2008 ). Do you know how >>> Jerasure >>> 2.0 relates to GF complete ? >>> >>> For completeness, here is a thread with pointers to Mojette Transform >>> that's being used as part of Rozofs. >>> >>> http://www.mail-archive.com/ceph-devel@xxxxxxxxxxxxxxx/msg14666.html >>> >>> I'm not able to compare it with the other libraries because it seems >>> to take a completely different approach. Do you have an opinion about it >>> ? >>> >>> As Patrick mentioned, I'll be at http://www.oscon.com/oscon2013 next >>> month but I'd love to understand more about this as soon as possible >>> :-) >>> >>> Cheers >>> >>> P.S. Updated >>> http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding >>> _as_ a_storage_backend#Erasure_Encoded_Storage with a link to >>> http://web.eecs.utk.edu/~plank/plank/www/software.html for the record >>> >>> On 06/14/2013 10:13 PM, Martin Flyvbjerg wrote: >>>> Dear Community >>>> I am a young engineer (not software or math, please bare with me) >>>> with >>> some suggestions regarding erasure codes. I never used Ceph before or >>> any other distributed file system. >>>> >>>> I stumped upon the suggestion for adding erasure codes to Ceph, as >>>> described in this article >>>> >>> http://wiki.Ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding >>> _as_ >>> a_storage_backend >>>> >>>> first I would like to say great initiative to add erasure codes to >>>> Ceph. >>>> Ceph needs its own implementation and it have to be done right, I >>>> cannot >>> stress this enough, suggested software mentioned in that article would >>> result in very low performance. >>>> >>>> Why? >>>> Reed-Solomon is normally something regarded as being very slow >>>> compared >>> to other erasure codes, because the underlying Galois-Field >>> multiplication is slow. Please see video at usenix.org forexplanation. >>>> >>>> The implementations of Zfec library and other suggested software the >>> others rely on the Vandermonde matrix, a matrix used in in >>> Reed-Solomon erasure codes, a faster approach would be to use the >>> Cauchy-Reed-Solomon implementation. Please see [1,2,3] >>>> Although there is something even better, by using the Intel SSE2/3 >>>> SIMD >>> instructions it is possible to do the as fast as any other XOR based >>> erasure codes (RaptorQ LT-codes, LDPC etc.). >>>> >>>> The suggested FECpp lib uses the optimisation but with a relative >>>> small >>> Galois-field only 2^8, since Ceph aimes at unlimited scalability >>> increasing the size of the Galois-Field would improve performance [4]. >>> Of course the configured Ceph Object Size and/or Stripe width have to >>> be taken into account. >>>> Please see >>>> https://www.usenix.org/conference/fast13/screaming-fast-galois-field >>>> - >>> arithmetic-using-sse2-extensions >>>> >>>> >>>> The solution >>>> Using the GF-Complete open source library [4] to implement >>>> Reed-Solomon >>> in Ceph in order to allow Ceph to scale to infinity. >>>> James S. Plank the author of GF-complete have developed a library >>> implementing various Reed-Solomon codes called Jerasure. >>> http://web.eecs.utk.edu/~plank/plank/www/software.html >>>> Jerasure 2.0 using the GF-complete artimetric based in Intel SSE >>>> SIMD >>> instructions, is current in development expected release august 2013. >>> Will be released under the new BSD license. Jerasure 2.0 also supports >>> arbitrary Galois-field sizes 8,16,32,64 or 128 bit. >>>> >>>> The limit of this implementation would be the processors L2/L3 cache >>>> not >>> the underlying arithmetic. >>>> >>>> Best Regards >>>> Martin Flyvbjerg >>>> >>>> [1] http://web.eecs.utk.edu/~plank/plank/papers/CS-05-569.pdf >>>> [2] http://web.eecs.utk.edu/~plank/plank/papers/CS-08-625.pdf >>>> [3] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2009.pdf >>>> [4] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2013-GF.pdf >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe >>>> ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> -- >>> Loïc Dachary, Artisan Logiciel Libre >>> All that is necessary for the triumph of evil is that good people do >>> nothing. >>> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at >> http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Loïc Dachary, Artisan Logiciel Libre All that is necessary for the triumph of evil is that good people do nothing.
Attachment:
signature.asc
Description: OpenPGP digital signature