On Mon, 19 Aug 2013, Loic Dachary wrote: > > > On 19/08/2013 02:01, Sage Weil wrote: > > On Sun, 18 Aug 2013, Loic Dachary wrote: > >> Hi Sage, > >> > >> Unless I misunderstood something ( which is still possible at this stage ;-) decode() is used both for recovery of missing chunks and retrieval of the original buffer. Decoding the M data chunks is a special case of decoding N <= M chunks out of the M+K chunks that were produced by encode(). It can be used to recover parity chunks as well as data chunks. > >> > >> https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#erasure-code-library-abstract-api > >> > >> map<int, buffer> decode(const set<int> &want_to_read, const map<int, buffer> &chunks) > >> > >> decode chunks to read the content of the want_to_read chunks and return a map associating the chunk number with its decoded content. For instance, in the simplest case M=2,K=1 for an encoded payload of data A and B with parity Z, calling > >> > >> decode([1,2], { 1 => 'A', 2 => 'B', 3 => 'Z' }) > >> => { 1 => 'A', 2 => 'B' } > >> > >> If however, the chunk B is to be read but is missing it will be: > >> > >> decode([2], { 1 => 'A', 3 => 'Z' }) > >> => { 2 => 'B' } > > > > Ah, I guess this works when some of the chunks contain the original > > data (as with a parity code). There are codes that don't work that way, > > although I suspect we won't use them. > > > > Regardless, I wonder if we should generalize slightly and have some > > methods work in terms of (offset,length) of the original stripe to > > generalize that bit. Then we would have something like > > > > map<int, buffer> transcode(const set<int> &want_to_read, const map<int, > > buffer>& chunks); > > > > to go from chunks -> chunks (as we would want to do with, say, a LRC-like > > code where we can rebuild some shards from a subset of the other shards). > > And then also have > > > > int decode(const map<int, buffer>& chunks, unsigned offset, > > unsigned len, bufferlist *out); > > This function would be implemented more or less as: > > set<int> want_to_read = range_to_chunks(offset, len) // compute what chunks must be retrieved > set<int> available = the up set > set<int> minimum = minimum_to_decode(want_to_read, available); > map<int, buffer> available_chunks = retrieve_chunks_from_osds(minimum); > map<int, buffer> chunks = transcode(want_to_read, available_chunks); // repairs if necessary > out = bufferptr(concat_chunks(chunks), offset - offset of the first chunk, len) > > or do you have something else in mind ? This makes sense. I am still wondering if it is worth generalizing this a bit further to codes without a nice mapping of a range -> want_to_read (i.e. that require decoding the entire stripe to get any part of it). For those codes, we would want to choose the N cheapest/available chunks and the sequence above would be a bit different. I guess in reality, though, we probably don't care to implement any such codes (I'm not sure what their advantages would be, if any)! sage > > > > > that recovers the original data. > > > > In our case, the read path would use decode, and for recovery we would use > > transcode. > > > > We'd also want to have alternate minimum_to_decode* methods, like > > > > virtual set<int> minimum_to_decode(unsigned offset, unsigned len, const > > set<int> &available_chunks) = 0; > > I also have a convenience wrapper in mind for this but I feel I'm missing something. > > Cheers > > > > > What do you think? > > > > sage > > > > > > > > > >> > >> Cheers > >> > >> On 18/08/2013 19:34, Sage Weil wrote: > >>> On Sun, 18 Aug 2013, Loic Dachary wrote: > >>>> Hi Ceph, > >>>> > >>>> I've implemented a draft of the Erasure Code plugin loader in the context of http://tracker.ceph.com/issues/5878. It has a trivial unit test and an example plugin. It would be great if someone could do a quick review. The general idea is that the erasure code pool calls something like: > >>>> > >>>> ErasureCodePlugin::factory(&erasure_code, "example", parameters) > >>>> > >>>> as shown at > >>>> > >>>> https://github.com/ceph/ceph/blob/5a2b1d66ae17b78addc14fee68c73985412f3c8c/src/test/osd/TestErasureCode.cc#L28 > >>>> > >>>> to get an object implementing the interface > >>>> > >>>> https://github.com/ceph/ceph/blob/5a2b1d66ae17b78addc14fee68c73985412f3c8c/src/osd/ErasureCodeInterface.h > >>>> > >>>> which matches the proposal described at > >>>> > >>>> https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#erasure-code-library-abstract-api > >>>> > >>>> The draft is at > >>>> > >>>> https://github.com/ceph/ceph/commit/5a2b1d66ae17b78addc14fee68c73985412f3c8c > >>>> > >>>> Thanks in advance :-) > >>> > >>> I haven't been following this discussion too closely, but taking a look > >>> now, the first 3 make sense, but > >>> > >>> virtual map<int, bufferptr> decode(const set<int> &want_to_read, const > >>> map<int, bufferptr> &chunks) = 0; > >>> > >>> it seems like this one should be more like > >>> > >>> virtual int decode(const map<int, bufferptr> &chunks, bufferlist *out); > >>> > >>> As in, you'd decode the chunks you have to get the actual data. If you > >>> want to get (missing) chunks for recovery, you'd do > >>> > >>> minimum_to_decode(...); // see what we need > >>> <fetch those chunks from other nodes> > >>> decode(...); // reconstruct original buffer > >>> encode(...); // encode missing chunks from original data > >>> > >>> sage > >>> -- > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >>> the body of a message to majordomo@xxxxxxxxxxxxxxx > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > >>> > >> > >> -- > >> Lo?c Dachary, Artisan Logiciel Libre > >> All that is necessary for the triumph of evil is that good people do nothing. > >> > >> > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Lo?c Dachary, Artisan Logiciel Libre > All that is necessary for the triumph of evil is that good people do nothing. > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html