Re: Comments on Ceph distributed parity implementation

Loic Dachary <loic@xxxxxxxxxxx> · Sat, 15 Jun 2013 00:57:07 +0200

Hi Martin,

Your explanations are very helpful to better understand the tradeoffs of the existing implementations. To be honest I was looking forward to your intervention. Not you specifically, of course :-) But someone with a good theoretical background to be a judge of what's best in the context of Ceph. If you say it's the upcoming library to be released in August 2013, I'll take your word for it. 

The work currently being done within Ceph is to architecture to storage backend ( namely placement groups ) to make room for distributed parity. My initial idea was to isolate the low level library under an API that takes a region ( 16KB for instance, as in gf_unit.c found in http://web.eecs.utk.edu/~plank/plank/papers/CS-13-703/gf_complete_0.1.tar ) as input and outputs chunks that can then be written on different hosts. For instance

	encode(char* region, char** chuncks) => encode the region into N chuncks
	decode(char** chunks, char* region) => decode the N chuncks into a region
        repair(char** chunks, int damaged) => repair the damaged chunck

Do you think it is a sensible approach ? And if you do, will I find examples of such higher level functions in http://web.eecs.utk.edu/~plank/plank/papers/CS-13-703/gf_complete_0.1.tar ? Or elsewhere ?

I'm a little confused about the relation between GF complete ( as found at http://web.eecs.utk.edu/~plank/plank/papers/CS-13-703/gf_complete_0.1.tar ) which is very recent ( 2013 ) and Jerasure ( as found at http://web.eecs.utk.edu/~plank/plank/papers/CS-08-627/Jerasure-1.2.tar )  which is comparatively older ( 2008 ). Do you know how Jerasure 2.0 relates to GF complete ? 

For completeness, here is a thread with pointers to Mojette Transform that's being used as part of Rozofs.

http://www.mail-archive.com/ceph-devel@xxxxxxxxxxxxxxx/msg14666.html

I'm not able to compare it with the other libraries because it seems to take a completely different approach. Do you have an opinion about it ?

As Patrick mentioned, I'll be at http://www.oscon.com/oscon2013 next month but I'd love to understand more about this as soon as possible :-)

Cheers

P.S. Updated http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend#Erasure_Encoded_Storage with a link to http://web.eecs.utk.edu/~plank/plank/www/software.html for the record

On 06/14/2013 10:13 PM, Martin Flyvbjerg wrote:
> Dear Community
> I am a young engineer (not software or math, please bare with me) with some suggestions regarding erasure codes. I never used Ceph before or any other distributed file system.
> 
> I stumped upon the suggestion for adding erasure codes to Ceph, as
> described in this article
> http://wiki.Ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend
> 
> first I would like to say great initiative to add erasure codes to Ceph.
> Ceph needs its own implementation and it have to be done right, I cannot stress this enough, suggested software mentioned in that article would result in very low performance.
> 
> Why?
> Reed-Solomon is normally something regarded as being very slow compared to other erasure codes, because the underlying Galois-Field multiplication is slow. Please see video at usenix.org forexplanation.
> 
> The implementations of Zfec library and other suggested software the others rely on the Vandermonde matrix, a matrix used in in Reed-Solomon erasure codes, a faster approach would be to use the Cauchy-Reed-Solomon implementation. Please see [1,2,3]
> Although there is something even better, by using the Intel SSE2/3 SIMD instructions it is possible to do the as fast as any other XOR based erasure codes (RaptorQ LT-codes, LDPC etc.).
> 
> The suggested FECpp lib uses the optimisation but with a relative small Galois-field only 2^8, since Ceph aimes at unlimited scalability increasing the size of the Galois-Field would improve performance [4]. Of course the configured Ceph Object Size and/or Stripe width have to be taken into account.
> Please see
> https://www.usenix.org/conference/fast13/screaming-fast-galois-field-arithmetic-using-sse2-extensions
> 
> 
> The solution
> Using the GF-Complete open source library [4] to implement Reed-Solomon in Ceph in order to allow Ceph to scale to infinity.
> James S. Plank the author of GF-complete have developed a library implementing various Reed-Solomon codes called Jerasure. http://web.eecs.utk.edu/~plank/plank/www/software.html
> Jerasure 2.0 using the GF-complete artimetric based in Intel SSE SIMD instructions, is current in development expected release august 2013. Will be released under the new BSD license. Jerasure 2.0 also supports arbitrary Galois-field sizes 8,16,32,64 or 128 bit.
> 
> The limit of this implementation would be the processors L2/L3 cache not the underlying arithmetic. 
> 
> Best Regards
> Martin Flyvbjerg
> 
> [1] http://web.eecs.utk.edu/~plank/plank/papers/CS-05-569.pdf
> [2] http://web.eecs.utk.edu/~plank/plank/papers/CS-08-625.pdf
> [3] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2009.pdf
> [4] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2013-GF.pdf
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

Attachment:
signature.asc

Description: OpenPGP digital signature