RE: Erasure coding library API

Paul Von-Stamwitz <PVonStamwitz@xxxxxxxxxxxxxx> · Wed, 3 Jul 2013 20:06:19 -0700

Scott, et al.

Here is an interesting paper from Usenix HotStorage Conference which provides local codes without additional capacity overhead.

Check it out. (abstract with links to paper and slides)
https://www.usenix.org/conference/hotstorage13/solution-network-challenges-data-recovery-erasure-coded-distributed-storage

Cheers,
pvs

> On Jul 3, 2013, at 11:19 AM, Paul Von-Stamwitz wrote:
> 
> Hi Scott,
> 
> Point taken.
> 
> I was thinking about Loic's decode description where k+m was requested and
> data was decoded when k blocks were received. But he was referring to full
> stripe reads where all the memory is allocated.
> 
> Degraded reads and block repair are a different matter.
> 
> pvs
> 
> > On Jul 3, 2013, at 4:53 AM Scott Atchley wrote:
> >
> > On Jul 2, 2013, at 10:12 PM, Paul Von-Stamwitz
> > <PVonStamwitz@xxxxxxxxxxxxxx> wrote:
> >
> > > Scott,
> > >
> > > You make a good point comparing (5/3) RS with Xorbas, but a small nit:
> > >
> > > "The I/O to recover from a single failure for both schemes is 5 blocks
> > so it is as efficient as Xorbas."
> > >
> > > Maybe not. You would probably issue I/O to all the remaining 7 blocks
> to
> > cover for the possibility of double errors. The time to reconstruct
> would
> > be about the same, but there could be more disk and network I/O. (LRC
> will
> > need to issue I/O to the rest of the global stripe if it detected
> multiple
> > local errors.)
> >
> > Why would you request more than five? If one cannot be read, request
> > another.
> >
> > Also, I am not sure that you want to request five at once since it will
> > lead to spikes in network traffic and require memory for all five blocks.
> > You will need at least two buffers. Request the first two and start the
> > decoding. You may want a third buffer to overlap the decoding of the
> > current block with the communication for the next block. It may be that
> > the decode time is less than the communication and, in that case, you
> will
> > want to request all of the blocks at once.
> >
> > > What I like about Xorbas is that it is an extension of a (x,y) RS. You
> > can start with traditional RS. If degraded reads and repaired blocks are
> > causing a problem, you can add the LRC. If capacity is an issue, you can
> > take it out.
> >
> > I like it too and Microsoft has something similar with Pyramid codes.
> That
> > said, my example using traditional RS can provide more fault-tolerance
> on
> > average given the same amount of storage overhead.
> >
> > >
> > > Best,
> > > Paul
> > >
> > > On Tue, Jul 2, 2013 at 2:33 PM, Samuel Just wrote:
> > >> I think we should be able to cover most cases by adding an interface
> > like:
> > >>
> > >> set<int> minimum_to_read(const set<int> &want_to_read, const set<int>
> > >> &available_chunks);
> > >>
> > >> which returns the smallest set required to read/rebuild the chunks in
> > >> want_to_read given the chunks in available_chunks.  Alternately, we
> > might
> > >> include a "cost" for reading each chunk like
> > >>
> > >> set<int> minimum_to_read_with_cost(const set<int> &want_to_read,
> const
> > >> map<int, int> &available)
> > >>
> > >> which returns the minimum cost set required to read the specified
> > chunks
> > >> given a mapping of available chunks to costs.  The costs might allow
> us
> > to
> > >> consider the difference between reading local chunks vs remote chunks.
> > >> This should be sufficient to cover the read case (esp the degraded
> read
> > >> case) and the repair case.
> > >> -Sam
> > >>
> > >> On Tue, Jul 2, 2013 at 10:14 AM, Atchley, Scott <atchleyes@xxxxxxxx>
> > >> wrote:
> > >>> On Jul 2, 2013, at 10:07 AM, "Atchley, Scott" <atchleyes@xxxxxxxx>
> > >> wrote:
> > >>>
> > >>>> On Jul 1, 2013, at 7:00 PM, Loic Dachary <loic@xxxxxxxxxxx> wrote:
> > >>>>
> > >>>>> Hi,
> > >>>>>
> > >>>>> Today Sam pointed out that the API for LRC ( Xorbas Hadoop Project
> > >> Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopUSC/
> for
> > >> instance ) would need to be different from the one initialy proposed:
> > >>>>
> > >>>> An interesting video. Not as entertaining as Jim Plank's video. ;-)
> > >>>>
> > >>>> While Plank's focused on the processor requirements for
> > >> encoding/decoding, this video focuses on the network and disk I/O
> > >> requirements.
> > >>>>
> > >>>>>  context(k, m, reed-solomon|...) => context* c
> > >>>>>  encode(context* c, void* data) => void* chunks[k+m]
> > >>>>>  decode(context* c, void* chunk[k+m], int*
> > >>>>> indices_of_erased_chunks) => void* data // erased chunks are not
> > used
> > >>>>>  repair(context* c, void* chunk[k+m], int*
> > >>>>> indices_of_erased_chunks) => void* chunks[k+m] // erased chunks
> are
> > >>>>> rebuilt
> > >>>>>
> > >>>>> The decode function must allow for partial read:
> > >>>>>
> > >>>>>  decode(context* c, int offset, int length, void* chunk[k+m], int*
> > >>>>> indices_of_erased_chunks, int* missing_chunks) => void* data
> > >>>>>
> > >>>>> If there are not enough chunks to recover the desired data range
> > >> [offset, offset+length) the function returns NULL and sets
> > missing_chunks
> > >> to the list of chunks that must be retrieved in order to be able to
> > read
> > >> the desired data.
> > >>>>>
> > >>>>> If decode is called to read just 1 chunk and it is missing, reed-
> > >> solomon would return on error and ask for all other chunks to repair.
> > If
> > >> the underlying library implements LRC, it would ask for a subset of
> the
> > >> chunks.
> > >>>>>
> > >>>>> An implementation allowing only full reads and using jerasure
> > ( which
> > >> does not do LRC ) requires that offset is always zero, length is the
> > size
> > >> of the object and returns a copy of indices_of_erased_chunks if there
> > are
> > >> not enough chunks to rebuild the missing ones.
> > >>>>>
> > >>>>> Comments are welcome :-)
> > >>>>
> > >>>> I have loosely followed this discussion and I have not looked
> closely
> > >> at the proposed API nor at the jerasure interface. My apologies if
> this
> > >> has already been addressed.
> > >>>>
> > >>>> It is not clear to me from the above proposed API (ignoring the
> > partial
> > >> read) what it would do. Was the original intent to encode the entire
> > file
> > >> using k+m blocks irregardless of the file size and of the rados
> object
> > >> size?
> > >>>>
> > >>>> If so, how will you map rados objects to the logical k+m objects
> and
> > >> vice versa?
> > >>>>
> > >>>> If not, then the initial API needed an offset and length (either
> > >> logical or rados object).
> > >>>>
> > >>>> I would assume that you would want to operate on rados sized
> objects.
> > >> Given a fixed k+m, then you may have more than one set of k+m objects
> > per
> > >> file. This is ignoring the LRC "local" parity blocks. For example, if
> > the
> > >> rados object size if 1 MB and k = 10 and m = 4 (as in the Xorbas
> video),
> > >> then for a 20 MB file one would need two sets of encoding blocks. The
> > >> first for objects 1-10 and the second for objects 11-20.
> > >>>>
> > >>>> Perhaps, this is what the context is above. If so, it should have
> the
> > >> logical offset and rados object size, no?
> > >>>>
> > >>>> I see value in the Xorbas concept and I wonder if the jerasure
> > library
> > >> can be modified to generate the local parity blocks such that they
> can
> > be
> > >> used to generate the global parity blocks. That would be a question
> for
> > >> Jim Plank.
> > >>>
> > >>> The benefits of the Xorbas concept is reduced network and disk I/O
> for
> > >> failures while maintaining traditional RS's higher fault-tolerance
> than
> > 3x
> > >> replication while using less space.
> > >>>
> > >>> You can do almost the same thing with jerasure without modifying it
> at
> > >> all. Using the values from the Xorbas video, they have k data blocks,
> m
> > >> global parity blocks, and 2 local parity blocks (generated from k/2
> > data
> > >> blocks) for a total of k+m+2 blocks on disk that can tolerate any m
> > >> failures. In their example, k = 10 and m = 4. They store 16 blocks
> for
> > >> each 10 data blocks.
> > >>>
> > >>> If you use traditional RS encoding via jerasure and used the same
> > amount
> > >> of storage (16 blocks for each 10 data blocks), you could encode 3
> > parity
> > >> blocks for each 5 data blocks. This would consume 16 data blocks for
> > each
> > >> 10 data blocks and the fault-tolerance would be variable from 3-6
> > failures
> > >> depending on how the failures fell between the two groups of 5 blocks
> > >> which is higher than the static 4 failures for the Xorbas code. The
> I/O
> > to
> > >> recover from a single failure for both schemes is 5 blocks so it is
> as
> > >> efficient as Xorbas. On average, it provides more fault-tolerance,
> but
> > it
> > >> can be less (four failures from one group of 5 data + 3 parity
> blocks),
> > >> but that worst case is the same as 3x replication.
> > >>>
> > >>> Scott--
> > >>> To unsubscribe from this list: send the line "unsubscribe ceph-
> devel"
> > >>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
> > >>> info at  http://vger.kernel.org/majordomo-info.html
> > >> --
> > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in
> > >> the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
> info
> > at
> > >> http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html