Re: Comments on Ceph distributed parity implementation

Harvey Skinner <hpmpec2a@xxxxxxxxx> · Tue, 18 Jun 2013 09:31:23 -0500

all,

nice discussion on implementing  erasure-encoding as an option in Ceph for
object store.   THis will be a great feature option.   I would recommend
that a good default be implemented  as being discussed here, but also that
an API or plug-in be architected at the same time so the deployment could
also use some other form of parity/erasure-encoding if required/desired.
Governments and other vendors will push/convince (or have hardcoded in an
RFP) that some other algorithm be used and allowing an API/plug-in
capability will allow Ceph to still be applicable. There are some
proprietary object store solutions today which allow other
erasure-encode/parity plug-ins which you will want to be competitive with.

I would see initial deployments of erasure-encoded object store be used for
static data objects; block device usage would come later if proved to be as
performant as replicated objects.   Replication as done now in Ceph may
still be a better block storage solution choice for DR strategies,
configuring replicas to be at different DR site(s).

just my view of course …

Harvey

> -----Original Message-----
> From: ceph-devel-owner@xxxxxxxxxxxxxxx
> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Paul Von-Stamwitz
> Sent: Friday, June 14, 2013 6:12 PM
> To: Loic Dachary; Martin Flyvbjerg
> Cc: ceph-devel@xxxxxxxxxxxxxxx
> Subject: RE: Comments on Ceph distributed parity implementation
>
> Hi Loic and Martin,
>
> This is a great discussion and I agree that the performance ramifications
> of erasure coding need to be thought out very carefully. Since chunks are
> not only encoded, but distributed across the cluster, we need to pay
> attention to the network overhead as well as the arithmetic involved in the
> encoding/decoding.
>
> If I understand the proposal correctly, objects begin their life's journey
> replicated as normal. As it grows cold, it gets transformed to an encode PG
> in another pool. Subsequent reads will be redirected (ehh). Subsequent
> writes will first decode the original object and re-replicate it (ouch!).
> Client writes never are encoded on the fly; they are always replicated
> (nice).
>
> So...
> encode() is run as a low-priority background process probably once a week
> during deep scrubs.
> decode() should be rare (if not, the object shouldn't have been encoded in
> the first place.) If the cluster is healthy no arithmetic is needed, just
> concatenation, but a lot of network activity.
> repair() operations will be the most prevalent and may occur at any time
> during normal self-healing/rebalancing operations.
>
> Therefore, in my opinion, the algorithm that we choose must be optimized
> for repairing damaged and missing chunks. The main problem I have with
> Reed-Solomon is that it uses MDS codes which maximize network activity for
> recalculations. Pyramid codes have the same write (encode) overhead, but
> have better read (repair) overhead.
>
> Loic, I know nothing about Mojette Transforms. From what little I gleaned,
> it might be good for repair (needing only a subset of chunks within a range
> to recalculate a missing chunk) but I'm worried about the storage
> efficiency. RozoFS claims 1.5x. I'd like to do better than that.
>
> All the best,
> Paul
>
> On 06/14/2013 3:57 PM, Loic Dachary wrote:
> > Hi Martin,
> >
> > Your explanations are very helpful to better understand the tradeoffs
> > of the existing implementations. To be honest I was looking forward to
> > your intervention. Not you specifically, of course :-) But someone
> > with a good theoretical background to be a judge of what's best in the
> > context of Ceph.
> > If you say it's the upcoming library to be released in August 2013,
> > I'll take your word for it.
> >
> > The work currently being done within Ceph is to architecture to
> > storage backend ( namely placement groups ) to make room for distributed
> > parity.
> > My initial idea was to isolate the low level library under an API that
> > takes a region ( 16KB for instance, as in gf_unit.c found in
> > http://web.eecs.utk.edu/~plank/plank/papers/CS-13-
> > 703/gf_complete_0.1.tar ) as input and outputs chunks that can then be
> > written on different hosts. For instance
> >
> >       encode(char* region, char** chuncks) => encode the region into N
> > chuncks
> >       decode(char** chunks, char* region) => decode the N chuncks into a
> > region
> >         repair(char** chunks, int damaged) => repair the damaged
> > chunck
> >
> > Do you think it is a sensible approach ? And if you do, will I find
> > examples of such higher level functions in
> > http://web.eecs.utk.edu/~plank/plank/papers/CS-13-
> > 703/gf_complete_0.1.tar ? Or elsewhere ?
> >
> > I'm a little confused about the relation between GF complete ( as
> > found at
> > http://web.eecs.utk.edu/~plank/plank/papers/CS-13-
> > 703/gf_complete_0.1.tar ) which is very recent ( 2013 ) and Jerasure (
> > as found at
> > http://web.eecs.utk.edu/~plank/plank/papers/CS-08-627/Jerasure-
> > 1.2.tar )  which is comparatively older ( 2008 ). Do you know how
> > Jerasure
> > 2.0 relates to GF complete ?
> >
> > For completeness, here is a thread with pointers to Mojette Transform
> > that's being used as part of Rozofs.
> >
> > http://www.mail-archive.com/ceph-devel@xxxxxxxxxxxxxxx/msg14666.html
> >
> > I'm not able to compare it with the other libraries because it seems
> > to take a completely different approach. Do you have an opinion about it
> > ?
> >
> > As Patrick mentioned, I'll be at http://www.oscon.com/oscon2013 next
> > month but I'd love to understand more about this as soon as possible
> > :-)
> >
> > Cheers
> >
> > P.S. Updated
> > http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding
> > _as_ a_storage_backend#Erasure_Encoded_Storage with a link to
> > http://web.eecs.utk.edu/~plank/plank/www/software.html for the record
> >
> > On 06/14/2013 10:13 PM, Martin Flyvbjerg wrote:
> > > Dear Community
> > > I am a young engineer (not software or math, please bare with me)
> > > with
> > some suggestions regarding erasure codes. I never used Ceph before or
> > any other distributed file system.
> > >
> > > I stumped upon the suggestion for adding erasure codes to Ceph, as
> > > described in this article
> > >
> > http://wiki.Ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding
> > _as_
> > a_storage_backend
> > >
> > > first I would like to say great initiative to add erasure codes to
> > > Ceph.
> > > Ceph needs its own implementation and it have to be done right, I
> > > cannot
> > stress this enough, suggested software mentioned in that article would
> > result in very low performance.
> > >
> > > Why?
> > > Reed-Solomon is normally something regarded as being very slow
> > > compared
> > to other erasure codes, because the underlying Galois-Field
> > multiplication is slow. Please see video at usenix.org forexplanation.
> > >
> > > The implementations of Zfec library and other suggested software the
> > others rely on the Vandermonde matrix, a matrix used in in
> > Reed-Solomon erasure codes, a faster approach would be to use the
> > Cauchy-Reed-Solomon implementation. Please see [1,2,3]
> > > Although there is something even better, by using the Intel SSE2/3
> > > SIMD
> > instructions it is possible to do the as fast as any other XOR based
> > erasure codes (RaptorQ LT-codes, LDPC etc.).
> > >
> > > The suggested FECpp lib uses the optimisation but with a relative
> > > small
> > Galois-field only 2^8, since Ceph aimes at unlimited scalability
> > increasing the size of the Galois-Field would improve performance [4].
> > Of course the configured Ceph Object Size and/or Stripe width have to
> > be taken into account.
> > > Please see
> > > https://www.usenix.org/conference/fast13/screaming-fast-galois-field
> > > -
> > arithmetic-using-sse2-extensions
> > >
> > >
> > > The solution
> > > Using the GF-Complete open source library [4] to implement
> > > Reed-Solomon
> > in Ceph in order to allow Ceph to scale to infinity.
> > > James S. Plank the author of GF-complete have developed a library
> > implementing various Reed-Solomon codes called Jerasure.
> > http://web.eecs.utk.edu/~plank/plank/www/software.html
> > > Jerasure 2.0 using the GF-complete artimetric based in Intel SSE
> > > SIMD
> > instructions, is current in development expected release august 2013.
> > Will be released under the new BSD license. Jerasure 2.0 also supports
> > arbitrary Galois-field sizes 8,16,32,64 or 128 bit.
> > >
> > > The limit of this implementation would be the processors L2/L3 cache
> > > not
> > the underlying arithmetic.
> > >
> > > Best Regards
> > > Martin Flyvbjerg
> > >
> > > [1] http://web.eecs.utk.edu/~plank/plank/papers/CS-05-569.pdf
> > > [2] http://web.eecs.utk.edu/~plank/plank/papers/CS-08-625.pdf
> > > [3] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2009.pdf
> > > [4] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2013-GF.pdf
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe
> > > ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> > --
> > Loïc Dachary, Artisan Logiciel Libre
> > All that is necessary for the triumph of evil is that good people do
> > nothing.
> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html