Re: Erasure coding library API

"Atchley, Scott" <atchleyes@xxxxxxxx> · Fri, 5 Jul 2013 08:13:52 -0400

Loic,

Erasure codes take what ever you give them. You need to verify the chunk before using it. Perhaps storing the checksum in the metadata/context that describes the parity object?

Scott

On Jul 4, 2013, at 9:24 AM, Loic Dachary <loic@xxxxxxxxxxx> wrote:

> Hi,
> 
> I was thinking about scrubbing of erasure coded chunks and realized I don't know the answer to this very simple question : what happens when a chunk is corrupted ? I.e. if AB is coded with 2+1 into A + B ( data ) + Z (parity ) and Z is replaced with Q. Would reed-solomon ignore/discard the corrupted chunk ? If that's the case I think it slightly changes what the API should be.
> 
> Cheers
> 
> On 04/07/2013 05:06, Paul Von-Stamwitz wrote:
>> Scott, et al.
>> 
>> Here is an interesting paper from Usenix HotStorage Conference which provides local codes without additional capacity overhead.
>> 
>> Check it out. (abstract with links to paper and slides)
>> https://www.usenix.org/conference/hotstorage13/solution-network-challenges-data-recovery-erasure-coded-distributed-storage
>> 
>> Cheers,
>> pvs
>> 
>>> On Jul 3, 2013, at 11:19 AM, Paul Von-Stamwitz wrote:
>>> 
>>> Hi Scott,
>>> 
>>> Point taken.
>>> 
>>> I was thinking about Loic's decode description where k+m was requested and
>>> data was decoded when k blocks were received. But he was referring to full
>>> stripe reads where all the memory is allocated.
>>> 
>>> Degraded reads and block repair are a different matter.
>>> 
>>> pvs
>>> 
>>>> On Jul 3, 2013, at 4:53 AM Scott Atchley wrote:
>>>> 
>>>> On Jul 2, 2013, at 10:12 PM, Paul Von-Stamwitz
>>>> <PVonStamwitz@xxxxxxxxxxxxxx> wrote:
>>>> 
>>>>> Scott,
>>>>> 
>>>>> You make a good point comparing (5/3) RS with Xorbas, but a small nit:
>>>>> 
>>>>> "The I/O to recover from a single failure for both schemes is 5 blocks
>>>> so it is as efficient as Xorbas."
>>>>> 
>>>>> Maybe not. You would probably issue I/O to all the remaining 7 blocks
>>> to
>>>> cover for the possibility of double errors. The time to reconstruct
>>> would
>>>> be about the same, but there could be more disk and network I/O. (LRC
>>> will
>>>> need to issue I/O to the rest of the global stripe if it detected
>>> multiple
>>>> local errors.)
>>>> 
>>>> Why would you request more than five? If one cannot be read, request
>>>> another.
>>>> 
>>>> Also, I am not sure that you want to request five at once since it will
>>>> lead to spikes in network traffic and require memory for all five blocks.
>>>> You will need at least two buffers. Request the first two and start the
>>>> decoding. You may want a third buffer to overlap the decoding of the
>>>> current block with the communication for the next block. It may be that
>>>> the decode time is less than the communication and, in that case, you
>>> will
>>>> want to request all of the blocks at once.
>>>> 
>>>>> What I like about Xorbas is that it is an extension of a (x,y) RS. You
>>>> can start with traditional RS. If degraded reads and repaired blocks are
>>>> causing a problem, you can add the LRC. If capacity is an issue, you can
>>>> take it out.
>>>> 
>>>> I like it too and Microsoft has something similar with Pyramid codes.
>>> That
>>>> said, my example using traditional RS can provide more fault-tolerance
>>> on
>>>> average given the same amount of storage overhead.
>>>> 
>>>>> 
>>>>> Best,
>>>>> Paul
>>>>> 
>>>>> On Tue, Jul 2, 2013 at 2:33 PM, Samuel Just wrote:
>>>>>> I think we should be able to cover most cases by adding an interface
>>>> like:
>>>>>> 
>>>>>> set<int> minimum_to_read(const set<int> &want_to_read, const set<int>
>>>>>> &available_chunks);
>>>>>> 
>>>>>> which returns the smallest set required to read/rebuild the chunks in
>>>>>> want_to_read given the chunks in available_chunks.  Alternately, we
>>>> might
>>>>>> include a "cost" for reading each chunk like
>>>>>> 
>>>>>> set<int> minimum_to_read_with_cost(const set<int> &want_to_read,
>>> const
>>>>>> map<int, int> &available)
>>>>>> 
>>>>>> which returns the minimum cost set required to read the specified
>>>> chunks
>>>>>> given a mapping of available chunks to costs.  The costs might allow
>>> us
>>>> to
>>>>>> consider the difference between reading local chunks vs remote chunks.
>>>>>> This should be sufficient to cover the read case (esp the degraded
>>> read
>>>>>> case) and the repair case.
>>>>>> -Sam
>>>>>> 
>>>>>> On Tue, Jul 2, 2013 at 10:14 AM, Atchley, Scott <atchleyes@xxxxxxxx>
>>>>>> wrote:
>>>>>>> On Jul 2, 2013, at 10:07 AM, "Atchley, Scott" <atchleyes@xxxxxxxx>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> On Jul 1, 2013, at 7:00 PM, Loic Dachary <loic@xxxxxxxxxxx> wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> Today Sam pointed out that the API for LRC ( Xorbas Hadoop Project
>>>>>> Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopUSC/
>>> for
>>>>>> instance ) would need to be different from the one initialy proposed:
>>>>>>>> 
>>>>>>>> An interesting video. Not as entertaining as Jim Plank's video. ;-)
>>>>>>>> 
>>>>>>>> While Plank's focused on the processor requirements for
>>>>>> encoding/decoding, this video focuses on the network and disk I/O
>>>>>> requirements.
>>>>>>>> 
>>>>>>>>> context(k, m, reed-solomon|...) => context* c
>>>>>>>>> encode(context* c, void* data) => void* chunks[k+m]
>>>>>>>>> decode(context* c, void* chunk[k+m], int*
>>>>>>>>> indices_of_erased_chunks) => void* data // erased chunks are not
>>>> used
>>>>>>>>> repair(context* c, void* chunk[k+m], int*
>>>>>>>>> indices_of_erased_chunks) => void* chunks[k+m] // erased chunks
>>> are
>>>>>>>>> rebuilt
>>>>>>>>> 
>>>>>>>>> The decode function must allow for partial read:
>>>>>>>>> 
>>>>>>>>> decode(context* c, int offset, int length, void* chunk[k+m], int*
>>>>>>>>> indices_of_erased_chunks, int* missing_chunks) => void* data
>>>>>>>>> 
>>>>>>>>> If there are not enough chunks to recover the desired data range
>>>>>> [offset, offset+length) the function returns NULL and sets
>>>> missing_chunks
>>>>>> to the list of chunks that must be retrieved in order to be able to
>>>> read
>>>>>> the desired data.
>>>>>>>>> 
>>>>>>>>> If decode is called to read just 1 chunk and it is missing, reed-
>>>>>> solomon would return on error and ask for all other chunks to repair.
>>>> If
>>>>>> the underlying library implements LRC, it would ask for a subset of
>>> the
>>>>>> chunks.
>>>>>>>>> 
>>>>>>>>> An implementation allowing only full reads and using jerasure
>>>> ( which
>>>>>> does not do LRC ) requires that offset is always zero, length is the
>>>> size
>>>>>> of the object and returns a copy of indices_of_erased_chunks if there
>>>> are
>>>>>> not enough chunks to rebuild the missing ones.
>>>>>>>>> 
>>>>>>>>> Comments are welcome :-)
>>>>>>>> 
>>>>>>>> I have loosely followed this discussion and I have not looked
>>> closely
>>>>>> at the proposed API nor at the jerasure interface. My apologies if
>>> this
>>>>>> has already been addressed.
>>>>>>>> 
>>>>>>>> It is not clear to me from the above proposed API (ignoring the
>>>> partial
>>>>>> read) what it would do. Was the original intent to encode the entire
>>>> file
>>>>>> using k+m blocks irregardless of the file size and of the rados
>>> object
>>>>>> size?
>>>>>>>> 
>>>>>>>> If so, how will you map rados objects to the logical k+m objects
>>> and
>>>>>> vice versa?
>>>>>>>> 
>>>>>>>> If not, then the initial API needed an offset and length (either
>>>>>> logical or rados object).
>>>>>>>> 
>>>>>>>> I would assume that you would want to operate on rados sized
>>> objects.
>>>>>> Given a fixed k+m, then you may have more than one set of k+m objects
>>>> per
>>>>>> file. This is ignoring the LRC "local" parity blocks. For example, if
>>>> the
>>>>>> rados object size if 1 MB and k = 10 and m = 4 (as in the Xorbas
>>> video),
>>>>>> then for a 20 MB file one would need two sets of encoding blocks. The
>>>>>> first for objects 1-10 and the second for objects 11-20.
>>>>>>>> 
>>>>>>>> Perhaps, this is what the context is above. If so, it should have
>>> the
>>>>>> logical offset and rados object size, no?
>>>>>>>> 
>>>>>>>> I see value in the Xorbas concept and I wonder if the jerasure
>>>> library
>>>>>> can be modified to generate the local parity blocks such that they
>>> can
>>>> be
>>>>>> used to generate the global parity blocks. That would be a question
>>> for
>>>>>> Jim Plank.
>>>>>>> 
>>>>>>> The benefits of the Xorbas concept is reduced network and disk I/O
>>> for
>>>>>> failures while maintaining traditional RS's higher fault-tolerance
>>> than
>>>> 3x
>>>>>> replication while using less space.
>>>>>>> 
>>>>>>> You can do almost the same thing with jerasure without modifying it
>>> at
>>>>>> all. Using the values from the Xorbas video, they have k data blocks,
>>> m
>>>>>> global parity blocks, and 2 local parity blocks (generated from k/2
>>>> data
>>>>>> blocks) for a total of k+m+2 blocks on disk that can tolerate any m
>>>>>> failures. In their example, k = 10 and m = 4. They store 16 blocks
>>> for
>>>>>> each 10 data blocks.
>>>>>>> 
>>>>>>> If you use traditional RS encoding via jerasure and used the same
>>>> amount
>>>>>> of storage (16 blocks for each 10 data blocks), you could encode 3
>>>> parity
>>>>>> blocks for each 5 data blocks. This would consume 16 data blocks for
>>>> each
>>>>>> 10 data blocks and the fault-tolerance would be variable from 3-6
>>>> failures
>>>>>> depending on how the failures fell between the two groups of 5 blocks
>>>>>> which is higher than the static 4 failures for the Xorbas code. The
>>> I/O
>>>> to
>>>>>> recover from a single failure for both schemes is 5 blocks so it is
>>> as
>>>>>> efficient as Xorbas. On average, it provides more fault-tolerance,
>>> but
>>>> it
>>>>>> can be less (four failures from one group of 5 data + 3 parity
>>> blocks),
>>>>>> but that worst case is the same as 3x replication.
>>>>>>> 
>>>>>>> Scott--
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
>>> devel"
>>>>>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in
>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
>>> info
>>>> at
>>>>>> http://vger.kernel.org/majordomo-info.html
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
> 
> -- 
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html