RE: Erasure coding library API

Paul Von-Stamwitz <PVonStamwitz@xxxxxxxxxxxxxx> · Tue, 2 Jul 2013 19:12:44 -0700

Scott,

You make a good point comparing (5/3) RS with Xorbas, but a small nit:

"The I/O to recover from a single failure for both schemes is 5 blocks so it is as efficient as Xorbas."

Maybe not. You would probably issue I/O to all the remaining 7 blocks to cover for the possibility of double errors. The time to reconstruct would be about the same, but there could be more disk and network I/O. (LRC will need to issue I/O to the rest of the global stripe if it detected multiple local errors.)

What I like about Xorbas is that it is an extension of a (x,y) RS. You can start with traditional RS. If degraded reads and repaired blocks are causing a problem, you can add the LRC. If capacity is an issue, you can take it out. 

Best,
Paul

On Tue, Jul 2, 2013 at 2:33 PM, Samuel Just wrote:
> I think we should be able to cover most cases by adding an interface like:
> 
> set<int> minimum_to_read(const set<int> &want_to_read, const set<int>
> &available_chunks);
> 
> which returns the smallest set required to read/rebuild the chunks in
> want_to_read given the chunks in available_chunks.  Alternately, we might
> include a "cost" for reading each chunk like
> 
> set<int> minimum_to_read_with_cost(const set<int> &want_to_read, const
> map<int, int> &available)
> 
> which returns the minimum cost set required to read the specified chunks
> given a mapping of available chunks to costs.  The costs might allow us to
> consider the difference between reading local chunks vs remote chunks.
> This should be sufficient to cover the read case (esp the degraded read
> case) and the repair case.
> -Sam
> 
> On Tue, Jul 2, 2013 at 10:14 AM, Atchley, Scott <atchleyes@xxxxxxxx>
> wrote:
> > On Jul 2, 2013, at 10:07 AM, "Atchley, Scott" <atchleyes@xxxxxxxx>
> wrote:
> >
> >> On Jul 1, 2013, at 7:00 PM, Loic Dachary <loic@xxxxxxxxxxx> wrote:
> >>
> >>> Hi,
> >>>
> >>> Today Sam pointed out that the API for LRC ( Xorbas Hadoop Project
> Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopUSC/ for
> instance ) would need to be different from the one initialy proposed:
> >>
> >> An interesting video. Not as entertaining as Jim Plank's video. ;-)
> >>
> >> While Plank's focused on the processor requirements for
> encoding/decoding, this video focuses on the network and disk I/O
> requirements.
> >>
> >>>   context(k, m, reed-solomon|...) => context* c
> >>>   encode(context* c, void* data) => void* chunks[k+m]
> >>>   decode(context* c, void* chunk[k+m], int*
> >>> indices_of_erased_chunks) => void* data // erased chunks are not used
> >>>   repair(context* c, void* chunk[k+m], int*
> >>> indices_of_erased_chunks) => void* chunks[k+m] // erased chunks are
> >>> rebuilt
> >>>
> >>> The decode function must allow for partial read:
> >>>
> >>>   decode(context* c, int offset, int length, void* chunk[k+m], int*
> >>> indices_of_erased_chunks, int* missing_chunks) => void* data
> >>>
> >>> If there are not enough chunks to recover the desired data range
> [offset, offset+length) the function returns NULL and sets missing_chunks
> to the list of chunks that must be retrieved in order to be able to read
> the desired data.
> >>>
> >>> If decode is called to read just 1 chunk and it is missing, reed-
> solomon would return on error and ask for all other chunks to repair. If
> the underlying library implements LRC, it would ask for a subset of the
> chunks.
> >>>
> >>> An implementation allowing only full reads and using jerasure ( which
> does not do LRC ) requires that offset is always zero, length is the size
> of the object and returns a copy of indices_of_erased_chunks if there are
> not enough chunks to rebuild the missing ones.
> >>>
> >>> Comments are welcome :-)
> >>
> >> I have loosely followed this discussion and I have not looked closely
> at the proposed API nor at the jerasure interface. My apologies if this
> has already been addressed.
> >>
> >> It is not clear to me from the above proposed API (ignoring the partial
> read) what it would do. Was the original intent to encode the entire file
> using k+m blocks irregardless of the file size and of the rados object
> size?
> >>
> >> If so, how will you map rados objects to the logical k+m objects and
> vice versa?
> >>
> >> If not, then the initial API needed an offset and length (either
> logical or rados object).
> >>
> >> I would assume that you would want to operate on rados sized objects.
> Given a fixed k+m, then you may have more than one set of k+m objects per
> file. This is ignoring the LRC "local" parity blocks. For example, if the
> rados object size if 1 MB and k = 10 and m = 4 (as in the Xorbas video),
> then for a 20 MB file one would need two sets of encoding blocks. The
> first for objects 1-10 and the second for objects 11-20.
> >>
> >> Perhaps, this is what the context is above. If so, it should have the
> logical offset and rados object size, no?
> >>
> >> I see value in the Xorbas concept and I wonder if the jerasure library
> can be modified to generate the local parity blocks such that they can be
> used to generate the global parity blocks. That would be a question for
> Jim Plank.
> >
> > The benefits of the Xorbas concept is reduced network and disk I/O for
> failures while maintaining traditional RS's higher fault-tolerance than 3x
> replication while using less space.
> >
> > You can do almost the same thing with jerasure without modifying it at
> all. Using the values from the Xorbas video, they have k data blocks, m
> global parity blocks, and 2 local parity blocks (generated from k/2 data
> blocks) for a total of k+m+2 blocks on disk that can tolerate any m
> failures. In their example, k = 10 and m = 4. They store 16 blocks for
> each 10 data blocks.
> >
> > If you use traditional RS encoding via jerasure and used the same amount
> of storage (16 blocks for each 10 data blocks), you could encode 3 parity
> blocks for each 5 data blocks. This would consume 16 data blocks for each
> 10 data blocks and the fault-tolerance would be variable from 3-6 failures
> depending on how the failures fell between the two groups of 5 blocks
> which is higher than the static 4 failures for the Xorbas code. The I/O to
> recover from a single failure for both schemes is 5 blocks so it is as
> efficient as Xorbas. On average, it provides more fault-tolerance, but it
> can be less (four failures from one group of 5 data + 3 parity blocks),
> but that worst case is the same as 3x replication.
> >
> > Scott--
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html