On Jul 5, 2013, at 12:34 PM, Loic Dachary <loic@xxxxxxxxxxx> wrote: > Hi Scott, > > That's unfortunate indeed and I now better understand why it is necessary to add a signature to the chunks. I added the following to: > > https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst > > Bit flips happen. Not often, but it is possible. Here is an article from 2011 also search for "bit rot" and "bit error rate". To detect corrupted chunks, a signature (SHA1 for instance) should be added as an attribute of the file containing the chunk so that deep scrubbing can check that the chunk is valid by rehashing the content of the chunk and compare it with the signature. Note, this can be local to the OSD to avoid sending an invalid chunk of the network. > While writing a more detailed version of the API taking your comments and Sam's comment in account, I realized that I don't know if it is possible to recover a parity chunk. If a data chunk is missing, it is enough to recover with a decode operation and write it back ( assuming a systematic code is used ). Would it be possible to do something similar with parity chunks ? Or would we need to re-encode all the parity chunks to write just one of them ? I assume the later but ... I'm not sure ;-) Yes, you can recover a parity chunk just as you would a data chunk. I have not used the jerasure library, so I do not know what it requires. > > Cheers > > On 05/07/2013 17:02, Atchley, Scott wrote: >> On Jul 5, 2013, at 10:06 AM, Loic Dachary <loic@xxxxxxxxxxx> wrote: >> >>> On 05/07/2013 14:13, Atchley, Scott wrote:> Loic, >>>> >>>> Erasure codes take what ever you give them. You need to verify the chunk before using it. Perhaps storing the checksum in the metadata/context that describes the parity object? >>> >>> Hi Scott, >>> >>> Does that mean that if I give the chunks A + B + C to decode() and B is corrupted but A and C are ok it will return an incorrectly decoded content ? >> >> Unfortunately, yes. >> >>> I'm curious to know the answer but I don't think it is an actual problem. A corrupted chunk would mean that the underlying file system corrupted the content of one of its file. I can't remember the last time I saw that happen ;-) >> >> Bit flips happen. Not often, but it is possible. Here is an article from 2011: >> >> http://www.linux-mag.com/id/8794/ >> >> also search for "bit rot" and "bit error rate". >> >> Scott >> >>> >>> Cheers >>> >>>> >>>> Scott >>>> >>>> On Jul 4, 2013, at 9:24 AM, Loic Dachary <loic@xxxxxxxxxxx> wrote: >>>> >>>>> Hi, >>>>> >>>>> I was thinking about scrubbing of erasure coded chunks and realized I don't know the answer to this very simple question : what happens when a chunk is corrupted ? I.e. if AB is coded with 2+1 into A + B ( data ) + Z (parity ) and Z is replaced with Q. Would reed-solomon ignore/discard the corrupted chunk ? If that's the case I think it slightly changes what the API should be. >>>>> >>>>> Cheers >>>>> >>>>> On 04/07/2013 05:06, Paul Von-Stamwitz wrote: >>>>>> Scott, et al. >>>>>> >>>>>> Here is an interesting paper from Usenix HotStorage Conference which provides local codes without additional capacity overhead. >>>>>> >>>>>> Check it out. (abstract with links to paper and slides) >>>>>> https://www.usenix.org/conference/hotstorage13/solution-network-challenges-data-recovery-erasure-coded-distributed-storage >>>>>> >>>>>> Cheers, >>>>>> pvs >>>>>> >>>>>>> On Jul 3, 2013, at 11:19 AM, Paul Von-Stamwitz wrote: >>>>>>> >>>>>>> Hi Scott, >>>>>>> >>>>>>> Point taken. >>>>>>> >>>>>>> I was thinking about Loic's decode description where k+m was requested and >>>>>>> data was decoded when k blocks were received. But he was referring to full >>>>>>> stripe reads where all the memory is allocated. >>>>>>> >>>>>>> Degraded reads and block repair are a different matter. >>>>>>> >>>>>>> pvs >>>>>>> >>>>>>>> On Jul 3, 2013, at 4:53 AM Scott Atchley wrote: >>>>>>>> >>>>>>>> On Jul 2, 2013, at 10:12 PM, Paul Von-Stamwitz >>>>>>>> <PVonStamwitz@xxxxxxxxxxxxxx> wrote: >>>>>>>> >>>>>>>>> Scott, >>>>>>>>> >>>>>>>>> You make a good point comparing (5/3) RS with Xorbas, but a small nit: >>>>>>>>> >>>>>>>>> "The I/O to recover from a single failure for both schemes is 5 blocks >>>>>>>> so it is as efficient as Xorbas." >>>>>>>>> >>>>>>>>> Maybe not. You would probably issue I/O to all the remaining 7 blocks >>>>>>> to >>>>>>>> cover for the possibility of double errors. The time to reconstruct >>>>>>> would >>>>>>>> be about the same, but there could be more disk and network I/O. (LRC >>>>>>> will >>>>>>>> need to issue I/O to the rest of the global stripe if it detected >>>>>>> multiple >>>>>>>> local errors.) >>>>>>>> >>>>>>>> Why would you request more than five? If one cannot be read, request >>>>>>>> another. >>>>>>>> >>>>>>>> Also, I am not sure that you want to request five at once since it will >>>>>>>> lead to spikes in network traffic and require memory for all five blocks. >>>>>>>> You will need at least two buffers. Request the first two and start the >>>>>>>> decoding. You may want a third buffer to overlap the decoding of the >>>>>>>> current block with the communication for the next block. It may be that >>>>>>>> the decode time is less than the communication and, in that case, you >>>>>>> will >>>>>>>> want to request all of the blocks at once. >>>>>>>> >>>>>>>>> What I like about Xorbas is that it is an extension of a (x,y) RS. You >>>>>>>> can start with traditional RS. If degraded reads and repaired blocks are >>>>>>>> causing a problem, you can add the LRC. If capacity is an issue, you can >>>>>>>> take it out. >>>>>>>> >>>>>>>> I like it too and Microsoft has something similar with Pyramid codes. >>>>>>> That >>>>>>>> said, my example using traditional RS can provide more fault-tolerance >>>>>>> on >>>>>>>> average given the same amount of storage overhead. >>>>>>>> >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Paul >>>>>>>>> >>>>>>>>> On Tue, Jul 2, 2013 at 2:33 PM, Samuel Just wrote: >>>>>>>>>> I think we should be able to cover most cases by adding an interface >>>>>>>> like: >>>>>>>>>> >>>>>>>>>> set<int> minimum_to_read(const set<int> &want_to_read, const set<int> >>>>>>>>>> &available_chunks); >>>>>>>>>> >>>>>>>>>> which returns the smallest set required to read/rebuild the chunks in >>>>>>>>>> want_to_read given the chunks in available_chunks. Alternately, we >>>>>>>> might >>>>>>>>>> include a "cost" for reading each chunk like >>>>>>>>>> >>>>>>>>>> set<int> minimum_to_read_with_cost(const set<int> &want_to_read, >>>>>>> const >>>>>>>>>> map<int, int> &available) >>>>>>>>>> >>>>>>>>>> which returns the minimum cost set required to read the specified >>>>>>>> chunks >>>>>>>>>> given a mapping of available chunks to costs. The costs might allow >>>>>>> us >>>>>>>> to >>>>>>>>>> consider the difference between reading local chunks vs remote chunks. >>>>>>>>>> This should be sufficient to cover the read case (esp the degraded >>>>>>> read >>>>>>>>>> case) and the repair case. >>>>>>>>>> -Sam >>>>>>>>>> >>>>>>>>>> On Tue, Jul 2, 2013 at 10:14 AM, Atchley, Scott <atchleyes@xxxxxxxx> >>>>>>>>>> wrote: >>>>>>>>>>> On Jul 2, 2013, at 10:07 AM, "Atchley, Scott" <atchleyes@xxxxxxxx> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> On Jul 1, 2013, at 7:00 PM, Loic Dachary <loic@xxxxxxxxxxx> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> Today Sam pointed out that the API for LRC ( Xorbas Hadoop Project >>>>>>>>>> Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopUSC/ >>>>>>> for >>>>>>>>>> instance ) would need to be different from the one initialy proposed: >>>>>>>>>>>> >>>>>>>>>>>> An interesting video. Not as entertaining as Jim Plank's video. ;-) >>>>>>>>>>>> >>>>>>>>>>>> While Plank's focused on the processor requirements for >>>>>>>>>> encoding/decoding, this video focuses on the network and disk I/O >>>>>>>>>> requirements. >>>>>>>>>>>> >>>>>>>>>>>>> context(k, m, reed-solomon|...) => context* c >>>>>>>>>>>>> encode(context* c, void* data) => void* chunks[k+m] >>>>>>>>>>>>> decode(context* c, void* chunk[k+m], int* >>>>>>>>>>>>> indices_of_erased_chunks) => void* data // erased chunks are not >>>>>>>> used >>>>>>>>>>>>> repair(context* c, void* chunk[k+m], int* >>>>>>>>>>>>> indices_of_erased_chunks) => void* chunks[k+m] // erased chunks >>>>>>> are >>>>>>>>>>>>> rebuilt >>>>>>>>>>>>> >>>>>>>>>>>>> The decode function must allow for partial read: >>>>>>>>>>>>> >>>>>>>>>>>>> decode(context* c, int offset, int length, void* chunk[k+m], int* >>>>>>>>>>>>> indices_of_erased_chunks, int* missing_chunks) => void* data >>>>>>>>>>>>> >>>>>>>>>>>>> If there are not enough chunks to recover the desired data range >>>>>>>>>> [offset, offset+length) the function returns NULL and sets >>>>>>>> missing_chunks >>>>>>>>>> to the list of chunks that must be retrieved in order to be able to >>>>>>>> read >>>>>>>>>> the desired data. >>>>>>>>>>>>> >>>>>>>>>>>>> If decode is called to read just 1 chunk and it is missing, reed- >>>>>>>>>> solomon would return on error and ask for all other chunks to repair. >>>>>>>> If >>>>>>>>>> the underlying library implements LRC, it would ask for a subset of >>>>>>> the >>>>>>>>>> chunks. >>>>>>>>>>>>> >>>>>>>>>>>>> An implementation allowing only full reads and using jerasure >>>>>>>> ( which >>>>>>>>>> does not do LRC ) requires that offset is always zero, length is the >>>>>>>> size >>>>>>>>>> of the object and returns a copy of indices_of_erased_chunks if there >>>>>>>> are >>>>>>>>>> not enough chunks to rebuild the missing ones. >>>>>>>>>>>>> >>>>>>>>>>>>> Comments are welcome :-) >>>>>>>>>>>> >>>>>>>>>>>> I have loosely followed this discussion and I have not looked >>>>>>> closely >>>>>>>>>> at the proposed API nor at the jerasure interface. My apologies if >>>>>>> this >>>>>>>>>> has already been addressed. >>>>>>>>>>>> >>>>>>>>>>>> It is not clear to me from the above proposed API (ignoring the >>>>>>>> partial >>>>>>>>>> read) what it would do. Was the original intent to encode the entire >>>>>>>> file >>>>>>>>>> using k+m blocks irregardless of the file size and of the rados >>>>>>> object >>>>>>>>>> size? >>>>>>>>>>>> >>>>>>>>>>>> If so, how will you map rados objects to the logical k+m objects >>>>>>> and >>>>>>>>>> vice versa? >>>>>>>>>>>> >>>>>>>>>>>> If not, then the initial API needed an offset and length (either >>>>>>>>>> logical or rados object). >>>>>>>>>>>> >>>>>>>>>>>> I would assume that you would want to operate on rados sized >>>>>>> objects. >>>>>>>>>> Given a fixed k+m, then you may have more than one set of k+m objects >>>>>>>> per >>>>>>>>>> file. This is ignoring the LRC "local" parity blocks. For example, if >>>>>>>> the >>>>>>>>>> rados object size if 1 MB and k = 10 and m = 4 (as in the Xorbas >>>>>>> video), >>>>>>>>>> then for a 20 MB file one would need two sets of encoding blocks. The >>>>>>>>>> first for objects 1-10 and the second for objects 11-20. >>>>>>>>>>>> >>>>>>>>>>>> Perhaps, this is what the context is above. If so, it should have >>>>>>> the >>>>>>>>>> logical offset and rados object size, no? >>>>>>>>>>>> >>>>>>>>>>>> I see value in the Xorbas concept and I wonder if the jerasure >>>>>>>> library >>>>>>>>>> can be modified to generate the local parity blocks such that they >>>>>>> can >>>>>>>> be >>>>>>>>>> used to generate the global parity blocks. That would be a question >>>>>>> for >>>>>>>>>> Jim Plank. >>>>>>>>>>> >>>>>>>>>>> The benefits of the Xorbas concept is reduced network and disk I/O >>>>>>> for >>>>>>>>>> failures while maintaining traditional RS's higher fault-tolerance >>>>>>> than >>>>>>>> 3x >>>>>>>>>> replication while using less space. >>>>>>>>>>> >>>>>>>>>>> You can do almost the same thing with jerasure without modifying it >>>>>>> at >>>>>>>>>> all. Using the values from the Xorbas video, they have k data blocks, >>>>>>> m >>>>>>>>>> global parity blocks, and 2 local parity blocks (generated from k/2 >>>>>>>> data >>>>>>>>>> blocks) for a total of k+m+2 blocks on disk that can tolerate any m >>>>>>>>>> failures. In their example, k = 10 and m = 4. They store 16 blocks >>>>>>> for >>>>>>>>>> each 10 data blocks. >>>>>>>>>>> >>>>>>>>>>> If you use traditional RS encoding via jerasure and used the same >>>>>>>> amount >>>>>>>>>> of storage (16 blocks for each 10 data blocks), you could encode 3 >>>>>>>> parity >>>>>>>>>> blocks for each 5 data blocks. This would consume 16 data blocks for >>>>>>>> each >>>>>>>>>> 10 data blocks and the fault-tolerance would be variable from 3-6 >>>>>>>> failures >>>>>>>>>> depending on how the failures fell between the two groups of 5 blocks >>>>>>>>>> which is higher than the static 4 failures for the Xorbas code. The >>>>>>> I/O >>>>>>>> to >>>>>>>>>> recover from a single failure for both schemes is 5 blocks so it is >>>>>>> as >>>>>>>>>> efficient as Xorbas. On average, it provides more fault-tolerance, >>>>>>> but >>>>>>>> it >>>>>>>>>> can be less (four failures from one group of 5 data + 3 parity >>>>>>> blocks), >>>>>>>>>> but that worst case is the same as 3x replication. >>>>>>>>>>> >>>>>>>>>>> Scott-- >>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph- >>>>>>> devel" >>>>>>>>>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo >>>>>>>>>>> info at http://vger.kernel.org/majordomo-info.html >>>>>>>>>> -- >>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>>>>> in >>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo >>>>>>> info >>>>>>>> at >>>>>>>>>> http://vger.kernel.org/majordomo-info.html >>>>>> >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>> >>>>> >>>>> -- >>>>> Loïc Dachary, Artisan Logiciel Libre >>>>> All that is necessary for the triumph of evil is that good people do nothing. >>>>> >>>> >>> >>> -- >>> Loïc Dachary, Artisan Logiciel Libre >>> All that is necessary for the triumph of evil is that good people do nothing. >>> >> > > -- > Loïc Dachary, Artisan Logiciel Libre > All that is necessary for the triumph of evil is that good people do nothing. > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html