On Jul 5, 2013, at 10:06 AM, Loic Dachary <loic@xxxxxxxxxxx> wrote: > On 05/07/2013 14:13, Atchley, Scott wrote:> Loic, >> >> Erasure codes take what ever you give them. You need to verify the chunk before using it. Perhaps storing the checksum in the metadata/context that describes the parity object? > > Hi Scott, > > Does that mean that if I give the chunks A + B + C to decode() and B is corrupted but A and C are ok it will return an incorrectly decoded content ? Unfortunately, yes. > I'm curious to know the answer but I don't think it is an actual problem. A corrupted chunk would mean that the underlying file system corrupted the content of one of its file. I can't remember the last time I saw that happen ;-) Bit flips happen. Not often, but it is possible. Here is an article from 2011: http://www.linux-mag.com/id/8794/ also search for "bit rot" and "bit error rate". Scott > > Cheers > >> >> Scott >> >> On Jul 4, 2013, at 9:24 AM, Loic Dachary <loic@xxxxxxxxxxx> wrote: >> >>> Hi, >>> >>> I was thinking about scrubbing of erasure coded chunks and realized I don't know the answer to this very simple question : what happens when a chunk is corrupted ? I.e. if AB is coded with 2+1 into A + B ( data ) + Z (parity ) and Z is replaced with Q. Would reed-solomon ignore/discard the corrupted chunk ? If that's the case I think it slightly changes what the API should be. >>> >>> Cheers >>> >>> On 04/07/2013 05:06, Paul Von-Stamwitz wrote: >>>> Scott, et al. >>>> >>>> Here is an interesting paper from Usenix HotStorage Conference which provides local codes without additional capacity overhead. >>>> >>>> Check it out. (abstract with links to paper and slides) >>>> https://www.usenix.org/conference/hotstorage13/solution-network-challenges-data-recovery-erasure-coded-distributed-storage >>>> >>>> Cheers, >>>> pvs >>>> >>>>> On Jul 3, 2013, at 11:19 AM, Paul Von-Stamwitz wrote: >>>>> >>>>> Hi Scott, >>>>> >>>>> Point taken. >>>>> >>>>> I was thinking about Loic's decode description where k+m was requested and >>>>> data was decoded when k blocks were received. But he was referring to full >>>>> stripe reads where all the memory is allocated. >>>>> >>>>> Degraded reads and block repair are a different matter. >>>>> >>>>> pvs >>>>> >>>>>> On Jul 3, 2013, at 4:53 AM Scott Atchley wrote: >>>>>> >>>>>> On Jul 2, 2013, at 10:12 PM, Paul Von-Stamwitz >>>>>> <PVonStamwitz@xxxxxxxxxxxxxx> wrote: >>>>>> >>>>>>> Scott, >>>>>>> >>>>>>> You make a good point comparing (5/3) RS with Xorbas, but a small nit: >>>>>>> >>>>>>> "The I/O to recover from a single failure for both schemes is 5 blocks >>>>>> so it is as efficient as Xorbas." >>>>>>> >>>>>>> Maybe not. You would probably issue I/O to all the remaining 7 blocks >>>>> to >>>>>> cover for the possibility of double errors. The time to reconstruct >>>>> would >>>>>> be about the same, but there could be more disk and network I/O. (LRC >>>>> will >>>>>> need to issue I/O to the rest of the global stripe if it detected >>>>> multiple >>>>>> local errors.) >>>>>> >>>>>> Why would you request more than five? If one cannot be read, request >>>>>> another. >>>>>> >>>>>> Also, I am not sure that you want to request five at once since it will >>>>>> lead to spikes in network traffic and require memory for all five blocks. >>>>>> You will need at least two buffers. Request the first two and start the >>>>>> decoding. You may want a third buffer to overlap the decoding of the >>>>>> current block with the communication for the next block. It may be that >>>>>> the decode time is less than the communication and, in that case, you >>>>> will >>>>>> want to request all of the blocks at once. >>>>>> >>>>>>> What I like about Xorbas is that it is an extension of a (x,y) RS. You >>>>>> can start with traditional RS. If degraded reads and repaired blocks are >>>>>> causing a problem, you can add the LRC. If capacity is an issue, you can >>>>>> take it out. >>>>>> >>>>>> I like it too and Microsoft has something similar with Pyramid codes. >>>>> That >>>>>> said, my example using traditional RS can provide more fault-tolerance >>>>> on >>>>>> average given the same amount of storage overhead. >>>>>> >>>>>>> >>>>>>> Best, >>>>>>> Paul >>>>>>> >>>>>>> On Tue, Jul 2, 2013 at 2:33 PM, Samuel Just wrote: >>>>>>>> I think we should be able to cover most cases by adding an interface >>>>>> like: >>>>>>>> >>>>>>>> set<int> minimum_to_read(const set<int> &want_to_read, const set<int> >>>>>>>> &available_chunks); >>>>>>>> >>>>>>>> which returns the smallest set required to read/rebuild the chunks in >>>>>>>> want_to_read given the chunks in available_chunks. Alternately, we >>>>>> might >>>>>>>> include a "cost" for reading each chunk like >>>>>>>> >>>>>>>> set<int> minimum_to_read_with_cost(const set<int> &want_to_read, >>>>> const >>>>>>>> map<int, int> &available) >>>>>>>> >>>>>>>> which returns the minimum cost set required to read the specified >>>>>> chunks >>>>>>>> given a mapping of available chunks to costs. The costs might allow >>>>> us >>>>>> to >>>>>>>> consider the difference between reading local chunks vs remote chunks. >>>>>>>> This should be sufficient to cover the read case (esp the degraded >>>>> read >>>>>>>> case) and the repair case. >>>>>>>> -Sam >>>>>>>> >>>>>>>> On Tue, Jul 2, 2013 at 10:14 AM, Atchley, Scott <atchleyes@xxxxxxxx> >>>>>>>> wrote: >>>>>>>>> On Jul 2, 2013, at 10:07 AM, "Atchley, Scott" <atchleyes@xxxxxxxx> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> On Jul 1, 2013, at 7:00 PM, Loic Dachary <loic@xxxxxxxxxxx> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> Today Sam pointed out that the API for LRC ( Xorbas Hadoop Project >>>>>>>> Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopUSC/ >>>>> for >>>>>>>> instance ) would need to be different from the one initialy proposed: >>>>>>>>>> >>>>>>>>>> An interesting video. Not as entertaining as Jim Plank's video. ;-) >>>>>>>>>> >>>>>>>>>> While Plank's focused on the processor requirements for >>>>>>>> encoding/decoding, this video focuses on the network and disk I/O >>>>>>>> requirements. >>>>>>>>>> >>>>>>>>>>> context(k, m, reed-solomon|...) => context* c >>>>>>>>>>> encode(context* c, void* data) => void* chunks[k+m] >>>>>>>>>>> decode(context* c, void* chunk[k+m], int* >>>>>>>>>>> indices_of_erased_chunks) => void* data // erased chunks are not >>>>>> used >>>>>>>>>>> repair(context* c, void* chunk[k+m], int* >>>>>>>>>>> indices_of_erased_chunks) => void* chunks[k+m] // erased chunks >>>>> are >>>>>>>>>>> rebuilt >>>>>>>>>>> >>>>>>>>>>> The decode function must allow for partial read: >>>>>>>>>>> >>>>>>>>>>> decode(context* c, int offset, int length, void* chunk[k+m], int* >>>>>>>>>>> indices_of_erased_chunks, int* missing_chunks) => void* data >>>>>>>>>>> >>>>>>>>>>> If there are not enough chunks to recover the desired data range >>>>>>>> [offset, offset+length) the function returns NULL and sets >>>>>> missing_chunks >>>>>>>> to the list of chunks that must be retrieved in order to be able to >>>>>> read >>>>>>>> the desired data. >>>>>>>>>>> >>>>>>>>>>> If decode is called to read just 1 chunk and it is missing, reed- >>>>>>>> solomon would return on error and ask for all other chunks to repair. >>>>>> If >>>>>>>> the underlying library implements LRC, it would ask for a subset of >>>>> the >>>>>>>> chunks. >>>>>>>>>>> >>>>>>>>>>> An implementation allowing only full reads and using jerasure >>>>>> ( which >>>>>>>> does not do LRC ) requires that offset is always zero, length is the >>>>>> size >>>>>>>> of the object and returns a copy of indices_of_erased_chunks if there >>>>>> are >>>>>>>> not enough chunks to rebuild the missing ones. >>>>>>>>>>> >>>>>>>>>>> Comments are welcome :-) >>>>>>>>>> >>>>>>>>>> I have loosely followed this discussion and I have not looked >>>>> closely >>>>>>>> at the proposed API nor at the jerasure interface. My apologies if >>>>> this >>>>>>>> has already been addressed. >>>>>>>>>> >>>>>>>>>> It is not clear to me from the above proposed API (ignoring the >>>>>> partial >>>>>>>> read) what it would do. Was the original intent to encode the entire >>>>>> file >>>>>>>> using k+m blocks irregardless of the file size and of the rados >>>>> object >>>>>>>> size? >>>>>>>>>> >>>>>>>>>> If so, how will you map rados objects to the logical k+m objects >>>>> and >>>>>>>> vice versa? >>>>>>>>>> >>>>>>>>>> If not, then the initial API needed an offset and length (either >>>>>>>> logical or rados object). >>>>>>>>>> >>>>>>>>>> I would assume that you would want to operate on rados sized >>>>> objects. >>>>>>>> Given a fixed k+m, then you may have more than one set of k+m objects >>>>>> per >>>>>>>> file. This is ignoring the LRC "local" parity blocks. For example, if >>>>>> the >>>>>>>> rados object size if 1 MB and k = 10 and m = 4 (as in the Xorbas >>>>> video), >>>>>>>> then for a 20 MB file one would need two sets of encoding blocks. The >>>>>>>> first for objects 1-10 and the second for objects 11-20. >>>>>>>>>> >>>>>>>>>> Perhaps, this is what the context is above. If so, it should have >>>>> the >>>>>>>> logical offset and rados object size, no? >>>>>>>>>> >>>>>>>>>> I see value in the Xorbas concept and I wonder if the jerasure >>>>>> library >>>>>>>> can be modified to generate the local parity blocks such that they >>>>> can >>>>>> be >>>>>>>> used to generate the global parity blocks. That would be a question >>>>> for >>>>>>>> Jim Plank. >>>>>>>>> >>>>>>>>> The benefits of the Xorbas concept is reduced network and disk I/O >>>>> for >>>>>>>> failures while maintaining traditional RS's higher fault-tolerance >>>>> than >>>>>> 3x >>>>>>>> replication while using less space. >>>>>>>>> >>>>>>>>> You can do almost the same thing with jerasure without modifying it >>>>> at >>>>>>>> all. Using the values from the Xorbas video, they have k data blocks, >>>>> m >>>>>>>> global parity blocks, and 2 local parity blocks (generated from k/2 >>>>>> data >>>>>>>> blocks) for a total of k+m+2 blocks on disk that can tolerate any m >>>>>>>> failures. In their example, k = 10 and m = 4. They store 16 blocks >>>>> for >>>>>>>> each 10 data blocks. >>>>>>>>> >>>>>>>>> If you use traditional RS encoding via jerasure and used the same >>>>>> amount >>>>>>>> of storage (16 blocks for each 10 data blocks), you could encode 3 >>>>>> parity >>>>>>>> blocks for each 5 data blocks. This would consume 16 data blocks for >>>>>> each >>>>>>>> 10 data blocks and the fault-tolerance would be variable from 3-6 >>>>>> failures >>>>>>>> depending on how the failures fell between the two groups of 5 blocks >>>>>>>> which is higher than the static 4 failures for the Xorbas code. The >>>>> I/O >>>>>> to >>>>>>>> recover from a single failure for both schemes is 5 blocks so it is >>>>> as >>>>>>>> efficient as Xorbas. On average, it provides more fault-tolerance, >>>>> but >>>>>> it >>>>>>>> can be less (four failures from one group of 5 data + 3 parity >>>>> blocks), >>>>>>>> but that worst case is the same as 3x replication. >>>>>>>>> >>>>>>>>> Scott-- >>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph- >>>>> devel" >>>>>>>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo >>>>>>>>> info at http://vger.kernel.org/majordomo-info.html >>>>>>>> -- >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>>> in >>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo >>>>> info >>>>>> at >>>>>>>> http://vger.kernel.org/majordomo-info.html >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>> >>> -- >>> Loïc Dachary, Artisan Logiciel Libre >>> All that is necessary for the triumph of evil is that good people do nothing. >>> >> > > -- > Loïc Dachary, Artisan Logiciel Libre > All that is necessary for the triumph of evil is that good people do nothing. > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html