Re: Comments on Ceph distributed parity implementation

Loic Dachary <loic@xxxxxxxxxxx> · Sat, 22 Jun 2013 10:26:44 +0200

>> The first, simplest implementation is likely to be fit to use with RGW and
>> probably too slow to use with RBD. Do you think we should try to optimize
>> for RBD right now ?
> 
> Yes, RGW is the obvious best candidate for the first implementation. We don't need to implement for RBD and CephFS now, but we should consider how the design would handle other applications in the future. The alternative is to optimize purely for RGW and provide an API/plug-in capability suggested by Harvey Skinner to make way for optimized solutions for other applications.
> 

I agree that the design should make room to plug in optimizations in the future. I've tried to figure out where the API/plug-in should fit. 

a) pluggable placement group
b) pluggable erasure code library

The pluggable placement group capability is what I'm working on right now. It requires some re-architecture of the current code and the API is starting to emerge. The implementation should eventually be in a separate shared library ( say ErasureCodePG ) loaded at run time and selected with a configuration option when creating a pool. I suspect that experimenting with new optimization strategies is going to be done by hacking ErasureCodePG and create new pools using it. 

Let say we find a way to optimize for RBD and implement that in the RBDErasureCodePG placement group. And we configure the RBD pool to use this placement group backend while keeping the ErasureCodePG placement group backend for RGW. Later on it may make sense to merge the two or make sure they share similar code for maintainance purposes. But that probably leaves all the room we need to experiment until a general solution is found.

The pluggable erasure code library API will be something like what is described in http://pad.ceph.com/p/Erasure_encoding_as_a_storage_backend

    context(k, m, reed-solomon|...) => context* c 
    encode(context* c, void* data) => void* chunks[k+m]
    decode(context* c, void* chunk[k+m], int* indices_of_erased_chunks) => void* data // erased chunks are not used
    repair(context* c, void* chunk[k+m], int* indices_of_erased_chunks) => void* chunks[k+m] // erased chunks are rebuilt

It won't be enough for hierarchical codes but they don't seem to be considered attractive at the moment. It should be enough for LRC ( http://anrg.usc.edu/~maheswaran/Xorbas.pdf ) since it only requires an additional argument to the context ( the number of chunks required to do a local repair ).

The need for another API ( in addition to pluggable placement groups and pluggable erasure code library ) may appear in the future. I can't see it right now. I try to refrain from over-engineering while making sure we don't need to re-architecture because something obvious was overlooked. This discussion is helping a lot :-) 

What do you think ?

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

Attachment:
signature.asc

Description: OpenPGP digital signature