Loic Dachary wrote: > Hi Alex, > > If I understand correctly, part of what you propose is to make use of > fountain codes to optimize replicas transmissions. It could even be used > to speed up replication by allowing existing copies to contribute to the > completion of the others. Although this is very appealing, it may be > outside of the scope of my current work. Unless you tell me that both > topics are intimately linked and should be handled simultaneously for some > reason :-) No, just an idea that hit while I was thinking about Raptor codes with regard to Ceph. I'd read up on them a bit ago and had seen a few papers describing video distribution systems that made use of that. > The erasure coded placement group could work as follows: > > * Placement group is using K+M OSDs > > * An object is sent by a client to the primary OSD > * The primary OSD code the object in K+M chunks > * The primary OSD writes the chunks to the OSDs in order Well, one thing to consider is whether you want the additional complexity of explicitly scheduling where they go, when it's been said before that a.) parity recovery is going to be a (somewhat) common operation and b.) we're unlikely to be CPU-bound on encode/decode, considering network and disk IO. It might be a win to have some simple header denoting which chunk is which, toss them onto the appropriate OSDs in whatever order, and just let the read-side decode whichever K it gets. With Reed-Solomon being an optimal erasure code, K are guaranteed to be enough. > * The primary OSD acknowledge the write > > * An object is read by a client from the primary OSD > * The primary OSD reads the K chunks from the K OSDs > * The primary OSD concatenates the K chunks and returns the result The other reason I suggested what I did above is that then you have exactly one code path here - fetch && decode. If the K you get are the input data, it's a nice degenerate case, but having a single codepath might help with reliability and maintainability. Unless it's known to be a bottleneck, this just tweaks my 'premature optimization' meter a little. If you want to add it later, it can be done without breaking compatibility - add a branch for if the chunks you got are the contiguous real data on the read side, preferentially fetch from certain OSDs, and do that location scheduling on the write side for new chunks. Over time, chunks will be allocated in such a way that you get the behavior you propose above, and for older chunks it just decodes them like always. > I'd be interested to know if you see a problem with this approach. I could > elaborate on how it would behave when scrubbing, or when the OSDMap > changes ( either because new OSDs becomes available or because an OSD > fails ). I'd like that, because those are exactly the cases I'd expect the explicit scheduling to be a burden rather than a help. > Cheers > <snip quotes> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html