RE: Erasure coding library API

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Scott, et al.

Here is an interesting paper from Usenix HotStorage Conference which provides local codes without additional capacity overhead.

Check it out. (abstract with links to paper and slides)
https://www.usenix.org/conference/hotstorage13/solution-network-challenges-data-recovery-erasure-coded-distributed-storage

Cheers,
pvs

> On Jul 3, 2013, at 11:19 AM, Paul Von-Stamwitz wrote:
> 
> Hi Scott,
> 
> Point taken.
> 
> I was thinking about Loic's decode description where k+m was requested and
> data was decoded when k blocks were received. But he was referring to full
> stripe reads where all the memory is allocated.
> 
> Degraded reads and block repair are a different matter.
> 
> pvs
> 
> > On Jul 3, 2013, at 4:53 AM Scott Atchley wrote:
> >
> > On Jul 2, 2013, at 10:12 PM, Paul Von-Stamwitz
> > <PVonStamwitz@xxxxxxxxxxxxxx> wrote:
> >
> > > Scott,
> > >
> > > You make a good point comparing (5/3) RS with Xorbas, but a small nit:
> > >
> > > "The I/O to recover from a single failure for both schemes is 5 blocks
> > so it is as efficient as Xorbas."
> > >
> > > Maybe not. You would probably issue I/O to all the remaining 7 blocks
> to
> > cover for the possibility of double errors. The time to reconstruct
> would
> > be about the same, but there could be more disk and network I/O. (LRC
> will
> > need to issue I/O to the rest of the global stripe if it detected
> multiple
> > local errors.)
> >
> > Why would you request more than five? If one cannot be read, request
> > another.
> >
> > Also, I am not sure that you want to request five at once since it will
> > lead to spikes in network traffic and require memory for all five blocks.
> > You will need at least two buffers. Request the first two and start the
> > decoding. You may want a third buffer to overlap the decoding of the
> > current block with the communication for the next block. It may be that
> > the decode time is less than the communication and, in that case, you
> will
> > want to request all of the blocks at once.
> >
> > > What I like about Xorbas is that it is an extension of a (x,y) RS. You
> > can start with traditional RS. If degraded reads and repaired blocks are
> > causing a problem, you can add the LRC. If capacity is an issue, you can
> > take it out.
> >
> > I like it too and Microsoft has something similar with Pyramid codes.
> That
> > said, my example using traditional RS can provide more fault-tolerance
> on
> > average given the same amount of storage overhead.
> >
> > >
> > > Best,
> > > Paul
> > >
> > > On Tue, Jul 2, 2013 at 2:33 PM, Samuel Just wrote:
> > >> I think we should be able to cover most cases by adding an interface
> > like:
> > >>
> > >> set<int> minimum_to_read(const set<int> &want_to_read, const set<int>
> > >> &available_chunks);
> > >>
> > >> which returns the smallest set required to read/rebuild the chunks in
> > >> want_to_read given the chunks in available_chunks.  Alternately, we
> > might
> > >> include a "cost" for reading each chunk like
> > >>
> > >> set<int> minimum_to_read_with_cost(const set<int> &want_to_read,
> const
> > >> map<int, int> &available)
> > >>
> > >> which returns the minimum cost set required to read the specified
> > chunks
> > >> given a mapping of available chunks to costs.  The costs might allow
> us
> > to
> > >> consider the difference between reading local chunks vs remote chunks.
> > >> This should be sufficient to cover the read case (esp the degraded
> read
> > >> case) and the repair case.
> > >> -Sam
> > >>
> > >> On Tue, Jul 2, 2013 at 10:14 AM, Atchley, Scott <atchleyes@xxxxxxxx>
> > >> wrote:
> > >>> On Jul 2, 2013, at 10:07 AM, "Atchley, Scott" <atchleyes@xxxxxxxx>
> > >> wrote:
> > >>>
> > >>>> On Jul 1, 2013, at 7:00 PM, Loic Dachary <loic@xxxxxxxxxxx> wrote:
> > >>>>
> > >>>>> Hi,
> > >>>>>
> > >>>>> Today Sam pointed out that the API for LRC ( Xorbas Hadoop Project
> > >> Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopUSC/
> for
> > >> instance ) would need to be different from the one initialy proposed:
> > >>>>
> > >>>> An interesting video. Not as entertaining as Jim Plank's video. ;-)
> > >>>>
> > >>>> While Plank's focused on the processor requirements for
> > >> encoding/decoding, this video focuses on the network and disk I/O
> > >> requirements.
> > >>>>
> > >>>>>  context(k, m, reed-solomon|...) => context* c
> > >>>>>  encode(context* c, void* data) => void* chunks[k+m]
> > >>>>>  decode(context* c, void* chunk[k+m], int*
> > >>>>> indices_of_erased_chunks) => void* data // erased chunks are not
> > used
> > >>>>>  repair(context* c, void* chunk[k+m], int*
> > >>>>> indices_of_erased_chunks) => void* chunks[k+m] // erased chunks
> are
> > >>>>> rebuilt
> > >>>>>
> > >>>>> The decode function must allow for partial read:
> > >>>>>
> > >>>>>  decode(context* c, int offset, int length, void* chunk[k+m], int*
> > >>>>> indices_of_erased_chunks, int* missing_chunks) => void* data
> > >>>>>
> > >>>>> If there are not enough chunks to recover the desired data range
> > >> [offset, offset+length) the function returns NULL and sets
> > missing_chunks
> > >> to the list of chunks that must be retrieved in order to be able to
> > read
> > >> the desired data.
> > >>>>>
> > >>>>> If decode is called to read just 1 chunk and it is missing, reed-
> > >> solomon would return on error and ask for all other chunks to repair.
> > If
> > >> the underlying library implements LRC, it would ask for a subset of
> the
> > >> chunks.
> > >>>>>
> > >>>>> An implementation allowing only full reads and using jerasure
> > ( which
> > >> does not do LRC ) requires that offset is always zero, length is the
> > size
> > >> of the object and returns a copy of indices_of_erased_chunks if there
> > are
> > >> not enough chunks to rebuild the missing ones.
> > >>>>>
> > >>>>> Comments are welcome :-)
> > >>>>
> > >>>> I have loosely followed this discussion and I have not looked
> closely
> > >> at the proposed API nor at the jerasure interface. My apologies if
> this
> > >> has already been addressed.
> > >>>>
> > >>>> It is not clear to me from the above proposed API (ignoring the
> > partial
> > >> read) what it would do. Was the original intent to encode the entire
> > file
> > >> using k+m blocks irregardless of the file size and of the rados
> object
> > >> size?
> > >>>>
> > >>>> If so, how will you map rados objects to the logical k+m objects
> and
> > >> vice versa?
> > >>>>
> > >>>> If not, then the initial API needed an offset and length (either
> > >> logical or rados object).
> > >>>>
> > >>>> I would assume that you would want to operate on rados sized
> objects.
> > >> Given a fixed k+m, then you may have more than one set of k+m objects
> > per
> > >> file. This is ignoring the LRC "local" parity blocks. For example, if
> > the
> > >> rados object size if 1 MB and k = 10 and m = 4 (as in the Xorbas
> video),
> > >> then for a 20 MB file one would need two sets of encoding blocks. The
> > >> first for objects 1-10 and the second for objects 11-20.
> > >>>>
> > >>>> Perhaps, this is what the context is above. If so, it should have
> the
> > >> logical offset and rados object size, no?
> > >>>>
> > >>>> I see value in the Xorbas concept and I wonder if the jerasure
> > library
> > >> can be modified to generate the local parity blocks such that they
> can
> > be
> > >> used to generate the global parity blocks. That would be a question
> for
> > >> Jim Plank.
> > >>>
> > >>> The benefits of the Xorbas concept is reduced network and disk I/O
> for
> > >> failures while maintaining traditional RS's higher fault-tolerance
> than
> > 3x
> > >> replication while using less space.
> > >>>
> > >>> You can do almost the same thing with jerasure without modifying it
> at
> > >> all. Using the values from the Xorbas video, they have k data blocks,
> m
> > >> global parity blocks, and 2 local parity blocks (generated from k/2
> > data
> > >> blocks) for a total of k+m+2 blocks on disk that can tolerate any m
> > >> failures. In their example, k = 10 and m = 4. They store 16 blocks
> for
> > >> each 10 data blocks.
> > >>>
> > >>> If you use traditional RS encoding via jerasure and used the same
> > amount
> > >> of storage (16 blocks for each 10 data blocks), you could encode 3
> > parity
> > >> blocks for each 5 data blocks. This would consume 16 data blocks for
> > each
> > >> 10 data blocks and the fault-tolerance would be variable from 3-6
> > failures
> > >> depending on how the failures fell between the two groups of 5 blocks
> > >> which is higher than the static 4 failures for the Xorbas code. The
> I/O
> > to
> > >> recover from a single failure for both schemes is 5 blocks so it is
> as
> > >> efficient as Xorbas. On average, it provides more fault-tolerance,
> but
> > it
> > >> can be less (four failures from one group of 5 data + 3 parity
> blocks),
> > >> but that worst case is the same as 3x replication.
> > >>>
> > >>> Scott--
> > >>> To unsubscribe from this list: send the line "unsubscribe ceph-
> devel"
> > >>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
> > >>> info at  http://vger.kernel.org/majordomo-info.html
> > >> --
> > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in
> > >> the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
> info
> > at
> > >> http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux