all, nice discussion on implementing erasure-encoding as an option in Ceph for object store. THis will be a great feature option. I would recommend that a good default be implemented as being discussed here, but also that an API or plug-in be architected at the same time so the deployment could also use some other form of parity/erasure-encoding if required/desired. Governments and other vendors will push/convince (or have hardcoded in an RFP) that some other algorithm be used and allowing an API/plug-in capability will allow Ceph to still be applicable. There are some proprietary object store solutions today which allow other erasure-encode/parity plug-ins which you will want to be competitive with. I would see initial deployments of erasure-encoded object store be used for static data objects; block device usage would come later if proved to be as performant as replicated objects. Replication as done now in Ceph may still be a better block storage solution choice for DR strategies, configuring replicas to be at different DR site(s). just my view of course … Harvey > -----Original Message----- > From: ceph-devel-owner@xxxxxxxxxxxxxxx > [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Paul Von-Stamwitz > Sent: Friday, June 14, 2013 6:12 PM > To: Loic Dachary; Martin Flyvbjerg > Cc: ceph-devel@xxxxxxxxxxxxxxx > Subject: RE: Comments on Ceph distributed parity implementation > > Hi Loic and Martin, > > This is a great discussion and I agree that the performance ramifications > of erasure coding need to be thought out very carefully. Since chunks are > not only encoded, but distributed across the cluster, we need to pay > attention to the network overhead as well as the arithmetic involved in the > encoding/decoding. > > If I understand the proposal correctly, objects begin their life's journey > replicated as normal. As it grows cold, it gets transformed to an encode PG > in another pool. Subsequent reads will be redirected (ehh). Subsequent > writes will first decode the original object and re-replicate it (ouch!). > Client writes never are encoded on the fly; they are always replicated > (nice). > > So... > encode() is run as a low-priority background process probably once a week > during deep scrubs. > decode() should be rare (if not, the object shouldn't have been encoded in > the first place.) If the cluster is healthy no arithmetic is needed, just > concatenation, but a lot of network activity. > repair() operations will be the most prevalent and may occur at any time > during normal self-healing/rebalancing operations. > > Therefore, in my opinion, the algorithm that we choose must be optimized > for repairing damaged and missing chunks. The main problem I have with > Reed-Solomon is that it uses MDS codes which maximize network activity for > recalculations. Pyramid codes have the same write (encode) overhead, but > have better read (repair) overhead. > > Loic, I know nothing about Mojette Transforms. From what little I gleaned, > it might be good for repair (needing only a subset of chunks within a range > to recalculate a missing chunk) but I'm worried about the storage > efficiency. RozoFS claims 1.5x. I'd like to do better than that. > > All the best, > Paul > > On 06/14/2013 3:57 PM, Loic Dachary wrote: > > Hi Martin, > > > > Your explanations are very helpful to better understand the tradeoffs > > of the existing implementations. To be honest I was looking forward to > > your intervention. Not you specifically, of course :-) But someone > > with a good theoretical background to be a judge of what's best in the > > context of Ceph. > > If you say it's the upcoming library to be released in August 2013, > > I'll take your word for it. > > > > The work currently being done within Ceph is to architecture to > > storage backend ( namely placement groups ) to make room for distributed > > parity. > > My initial idea was to isolate the low level library under an API that > > takes a region ( 16KB for instance, as in gf_unit.c found in > > http://web.eecs.utk.edu/~plank/plank/papers/CS-13- > > 703/gf_complete_0.1.tar ) as input and outputs chunks that can then be > > written on different hosts. For instance > > > > encode(char* region, char** chuncks) => encode the region into N > > chuncks > > decode(char** chunks, char* region) => decode the N chuncks into a > > region > > repair(char** chunks, int damaged) => repair the damaged > > chunck > > > > Do you think it is a sensible approach ? And if you do, will I find > > examples of such higher level functions in > > http://web.eecs.utk.edu/~plank/plank/papers/CS-13- > > 703/gf_complete_0.1.tar ? Or elsewhere ? > > > > I'm a little confused about the relation between GF complete ( as > > found at > > http://web.eecs.utk.edu/~plank/plank/papers/CS-13- > > 703/gf_complete_0.1.tar ) which is very recent ( 2013 ) and Jerasure ( > > as found at > > http://web.eecs.utk.edu/~plank/plank/papers/CS-08-627/Jerasure- > > 1.2.tar ) which is comparatively older ( 2008 ). Do you know how > > Jerasure > > 2.0 relates to GF complete ? > > > > For completeness, here is a thread with pointers to Mojette Transform > > that's being used as part of Rozofs. > > > > http://www.mail-archive.com/ceph-devel@xxxxxxxxxxxxxxx/msg14666.html > > > > I'm not able to compare it with the other libraries because it seems > > to take a completely different approach. Do you have an opinion about it > > ? > > > > As Patrick mentioned, I'll be at http://www.oscon.com/oscon2013 next > > month but I'd love to understand more about this as soon as possible > > :-) > > > > Cheers > > > > P.S. Updated > > http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding > > _as_ a_storage_backend#Erasure_Encoded_Storage with a link to > > http://web.eecs.utk.edu/~plank/plank/www/software.html for the record > > > > On 06/14/2013 10:13 PM, Martin Flyvbjerg wrote: > > > Dear Community > > > I am a young engineer (not software or math, please bare with me) > > > with > > some suggestions regarding erasure codes. I never used Ceph before or > > any other distributed file system. > > > > > > I stumped upon the suggestion for adding erasure codes to Ceph, as > > > described in this article > > > > > http://wiki.Ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding > > _as_ > > a_storage_backend > > > > > > first I would like to say great initiative to add erasure codes to > > > Ceph. > > > Ceph needs its own implementation and it have to be done right, I > > > cannot > > stress this enough, suggested software mentioned in that article would > > result in very low performance. > > > > > > Why? > > > Reed-Solomon is normally something regarded as being very slow > > > compared > > to other erasure codes, because the underlying Galois-Field > > multiplication is slow. Please see video at usenix.org forexplanation. > > > > > > The implementations of Zfec library and other suggested software the > > others rely on the Vandermonde matrix, a matrix used in in > > Reed-Solomon erasure codes, a faster approach would be to use the > > Cauchy-Reed-Solomon implementation. Please see [1,2,3] > > > Although there is something even better, by using the Intel SSE2/3 > > > SIMD > > instructions it is possible to do the as fast as any other XOR based > > erasure codes (RaptorQ LT-codes, LDPC etc.). > > > > > > The suggested FECpp lib uses the optimisation but with a relative > > > small > > Galois-field only 2^8, since Ceph aimes at unlimited scalability > > increasing the size of the Galois-Field would improve performance [4]. > > Of course the configured Ceph Object Size and/or Stripe width have to > > be taken into account. > > > Please see > > > https://www.usenix.org/conference/fast13/screaming-fast-galois-field > > > - > > arithmetic-using-sse2-extensions > > > > > > > > > The solution > > > Using the GF-Complete open source library [4] to implement > > > Reed-Solomon > > in Ceph in order to allow Ceph to scale to infinity. > > > James S. Plank the author of GF-complete have developed a library > > implementing various Reed-Solomon codes called Jerasure. > > http://web.eecs.utk.edu/~plank/plank/www/software.html > > > Jerasure 2.0 using the GF-complete artimetric based in Intel SSE > > > SIMD > > instructions, is current in development expected release august 2013. > > Will be released under the new BSD license. Jerasure 2.0 also supports > > arbitrary Galois-field sizes 8,16,32,64 or 128 bit. > > > > > > The limit of this implementation would be the processors L2/L3 cache > > > not > > the underlying arithmetic. > > > > > > Best Regards > > > Martin Flyvbjerg > > > > > > [1] http://web.eecs.utk.edu/~plank/plank/papers/CS-05-569.pdf > > > [2] http://web.eecs.utk.edu/~plank/plank/papers/CS-08-625.pdf > > > [3] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2009.pdf > > > [4] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2013-GF.pdf > > > -- > > > To unsubscribe from this list: send the line "unsubscribe > > > ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > > Loïc Dachary, Artisan Logiciel Libre > > All that is necessary for the triumph of evil is that good people do > > nothing. > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at > http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html