On Fri, 24 Aug 2012, Stephen Perkins wrote: > Hi all, > > I'd like to get feedback from folks as to where the best place would be to > insert a "shim" into the RADOS object storage. > > Currently, you can configure RADOS to use copy based storage to store > redundant copies of a file (I like 3 redundant copies so I will use that as > an example). So... each file is stored in three locations on independent > hardware. The redundancy has a cost of 3x the storage. > > I would assume that it is "possible" to configure RADOS to store only 1 copy > of a file (bear with me here). > > I'd like to see where it may be possible to insert a "shim" in the storage > such that I can take the file to be stored and apply some erasure coding to > it. Therefore, the file now becomes multiple files that are handed off to > RADOS. > > The shim would also have to take read file requests and read some small > portion of the fragments and recombine. > > Basically... what I am asking is... where would be the best place to start > looking at adding this: > https://tahoe-lafs.org/trac/tahoe-lafs# > > (just the erasure coded part). > > Here is the real rationale. Extreme availability at only 1.3 or 1.6 time > redundancy: > > http://www.zdnet.com/videos/whiteboard/dispersed-storage/156114 > > Thoughts appreciated, The good news is that CRUSH has a mode that is intended for erasure/parity coding, and there are fields reserved in many ceph structures to support this type of thing. The bad news is that in order to make it work it needs to live inside of rados, not on top of it. The reason is that you need to separate the fragments across devices/failure domains/etc, which happens at the PG level; users of librados have no control over that (objects are randomly hashed into PGs, and then PGs are mapped to devices). And in order to implement it properly, a lot of code shuffling and wading through OSD internals will be necessary. There are some basic abstractions in place, but they are largely ignored and need to be shifted around because replication has been the only implementation for some time now. I think the only way to layer this on top of rados and align your fragments with failure domains would be to create N different pools with distinct devices, and store one fragment in each pool... sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html