On Jul 2, 2013, at 10:12 PM, Paul Von-Stamwitz <PVonStamwitz@xxxxxxxxxxxxxx> wrote: > Scott, > > You make a good point comparing (5/3) RS with Xorbas, but a small nit: > > "The I/O to recover from a single failure for both schemes is 5 blocks so it is as efficient as Xorbas." > > Maybe not. You would probably issue I/O to all the remaining 7 blocks to cover for the possibility of double errors. The time to reconstruct would be about the same, but there could be more disk and network I/O. (LRC will need to issue I/O to the rest of the global stripe if it detected multiple local errors.) Why would you request more than five? If one cannot be read, request another. Also, I am not sure that you want to request five at once since it will lead to spikes in network traffic and require memory for all five blocks. You will need at least two buffers. Request the first two and start the decoding. You may want a third buffer to overlap the decoding of the current block with the communication for the next block. It may be that the decode time is less than the communication and, in that case, you will want to request all of the blocks at once. > What I like about Xorbas is that it is an extension of a (x,y) RS. You can start with traditional RS. If degraded reads and repaired blocks are causing a problem, you can add the LRC. If capacity is an issue, you can take it out. I like it too and Microsoft has something similar with Pyramid codes. That said, my example using traditional RS can provide more fault-tolerance on average given the same amount of storage overhead. > > Best, > Paul > > On Tue, Jul 2, 2013 at 2:33 PM, Samuel Just wrote: >> I think we should be able to cover most cases by adding an interface like: >> >> set<int> minimum_to_read(const set<int> &want_to_read, const set<int> >> &available_chunks); >> >> which returns the smallest set required to read/rebuild the chunks in >> want_to_read given the chunks in available_chunks. Alternately, we might >> include a "cost" for reading each chunk like >> >> set<int> minimum_to_read_with_cost(const set<int> &want_to_read, const >> map<int, int> &available) >> >> which returns the minimum cost set required to read the specified chunks >> given a mapping of available chunks to costs. The costs might allow us to >> consider the difference between reading local chunks vs remote chunks. >> This should be sufficient to cover the read case (esp the degraded read >> case) and the repair case. >> -Sam >> >> On Tue, Jul 2, 2013 at 10:14 AM, Atchley, Scott <atchleyes@xxxxxxxx> >> wrote: >>> On Jul 2, 2013, at 10:07 AM, "Atchley, Scott" <atchleyes@xxxxxxxx> >> wrote: >>> >>>> On Jul 1, 2013, at 7:00 PM, Loic Dachary <loic@xxxxxxxxxxx> wrote: >>>> >>>>> Hi, >>>>> >>>>> Today Sam pointed out that the API for LRC ( Xorbas Hadoop Project >> Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopUSC/ for >> instance ) would need to be different from the one initialy proposed: >>>> >>>> An interesting video. Not as entertaining as Jim Plank's video. ;-) >>>> >>>> While Plank's focused on the processor requirements for >> encoding/decoding, this video focuses on the network and disk I/O >> requirements. >>>> >>>>> context(k, m, reed-solomon|...) => context* c >>>>> encode(context* c, void* data) => void* chunks[k+m] >>>>> decode(context* c, void* chunk[k+m], int* >>>>> indices_of_erased_chunks) => void* data // erased chunks are not used >>>>> repair(context* c, void* chunk[k+m], int* >>>>> indices_of_erased_chunks) => void* chunks[k+m] // erased chunks are >>>>> rebuilt >>>>> >>>>> The decode function must allow for partial read: >>>>> >>>>> decode(context* c, int offset, int length, void* chunk[k+m], int* >>>>> indices_of_erased_chunks, int* missing_chunks) => void* data >>>>> >>>>> If there are not enough chunks to recover the desired data range >> [offset, offset+length) the function returns NULL and sets missing_chunks >> to the list of chunks that must be retrieved in order to be able to read >> the desired data. >>>>> >>>>> If decode is called to read just 1 chunk and it is missing, reed- >> solomon would return on error and ask for all other chunks to repair. If >> the underlying library implements LRC, it would ask for a subset of the >> chunks. >>>>> >>>>> An implementation allowing only full reads and using jerasure ( which >> does not do LRC ) requires that offset is always zero, length is the size >> of the object and returns a copy of indices_of_erased_chunks if there are >> not enough chunks to rebuild the missing ones. >>>>> >>>>> Comments are welcome :-) >>>> >>>> I have loosely followed this discussion and I have not looked closely >> at the proposed API nor at the jerasure interface. My apologies if this >> has already been addressed. >>>> >>>> It is not clear to me from the above proposed API (ignoring the partial >> read) what it would do. Was the original intent to encode the entire file >> using k+m blocks irregardless of the file size and of the rados object >> size? >>>> >>>> If so, how will you map rados objects to the logical k+m objects and >> vice versa? >>>> >>>> If not, then the initial API needed an offset and length (either >> logical or rados object). >>>> >>>> I would assume that you would want to operate on rados sized objects. >> Given a fixed k+m, then you may have more than one set of k+m objects per >> file. This is ignoring the LRC "local" parity blocks. For example, if the >> rados object size if 1 MB and k = 10 and m = 4 (as in the Xorbas video), >> then for a 20 MB file one would need two sets of encoding blocks. The >> first for objects 1-10 and the second for objects 11-20. >>>> >>>> Perhaps, this is what the context is above. If so, it should have the >> logical offset and rados object size, no? >>>> >>>> I see value in the Xorbas concept and I wonder if the jerasure library >> can be modified to generate the local parity blocks such that they can be >> used to generate the global parity blocks. That would be a question for >> Jim Plank. >>> >>> The benefits of the Xorbas concept is reduced network and disk I/O for >> failures while maintaining traditional RS's higher fault-tolerance than 3x >> replication while using less space. >>> >>> You can do almost the same thing with jerasure without modifying it at >> all. Using the values from the Xorbas video, they have k data blocks, m >> global parity blocks, and 2 local parity blocks (generated from k/2 data >> blocks) for a total of k+m+2 blocks on disk that can tolerate any m >> failures. In their example, k = 10 and m = 4. They store 16 blocks for >> each 10 data blocks. >>> >>> If you use traditional RS encoding via jerasure and used the same amount >> of storage (16 blocks for each 10 data blocks), you could encode 3 parity >> blocks for each 5 data blocks. This would consume 16 data blocks for each >> 10 data blocks and the fault-tolerance would be variable from 3-6 failures >> depending on how the failures fell between the two groups of 5 blocks >> which is higher than the static 4 failures for the Xorbas code. The I/O to >> recover from a single failure for both schemes is 5 blocks so it is as >> efficient as Xorbas. On average, it provides more fault-tolerance, but it >> can be less (four failures from one group of 5 data + 3 parity blocks), >> but that worst case is the same as 3x replication. >>> >>> Scott-- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo >>> info at http://vger.kernel.org/majordomo-info.html >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at >> http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html