Re: Best insertion point for storage shim

"Atchley, Scott" <atchleyes@xxxxxxxx> · Fri, 24 Aug 2012 12:42:19 -0400

On Aug 24, 2012, at 11:49 AM, Stephen Perkins wrote:

> Hi all,
> 
> I'd like to get feedback from folks as to where the best place would be to
> insert a "shim" into the RADOS object storage.
> 
> Currently, you can configure RADOS to use copy based storage to store
> redundant copies of a file (I like 3 redundant copies so I will use that as
> an example).  So... each file is stored in three locations on independent
> hardware.   The redundancy has a cost of 3x the storage.
> 
> I would assume that it is "possible" to configure RADOS to store only 1 copy
> of a file (bear with me here).
> 
> I'd like to see where it may be possible to insert a "shim" in the storage
> such that I can take the file to be stored and apply some erasure coding to
> it. Therefore, the file now becomes multiple files that are handed off to
> RADOS.
> 
> The shim would also have to take read file requests and read some small
> portion of the fragments and recombine.

This sounds more like a modification to the POSIX file system interface rather than to the RADOS object store which knows nothing of files.

> Basically... what I am asking is...  where would be the best place to start
> looking at adding this:
> 	https://tahoe-lafs.org/trac/tahoe-lafs#
> 	
> (just the erasure coded part).
> 
> Here is the real rationale.  Extreme availability at only 1.3 or 1.6 time
> redundancy:
> 
> 	http://www.zdnet.com/videos/whiteboard/dispersed-storage/156114

The "extreme" reliability is a bit oversold. I worked on a project a decade ago that stored blocks of files over servers scattered around the globe. Each block was checksummed and optionally encrypted (they were not our servers, so we did not assume that we could trust the admins). To handle reliability, we implemented both replication (copies) and error coding (Reed-Solomon based erasure coding). There is a trade-off between the two.

Copies are nice since they do not require extra computation and they can be handled between servers so that the client only has to store once (which is what the ceph file system does). Copies also allow you to load balance over more servers and increase read access (which ceph does not do since the copies are pseudo-randomly stored and _should_ provide load-balancing on average). With good CRUSH rules, they should provide better fault-tolerance (e.g. a rack goes down, pull from a copy on another rack). The maximum failure level is N where N is the number of copies. That also means that your total usable storage is 1/Nth the raw capacity.

Error coding allows you to tolerate greater number of failures at the expense of computation and memory usage. When using error coding, you break up a file into blocks (as mentioned in the video). For a set of M blocks (this is the coding block set size), you create one or more (N) coding blocks. In the video example, 1.3 corresponds to one coding block per three data blocks (N=1 and M=3). This means it can tolerate losing one of the four blocks and still recompute the original data using any 3 of the M + N blocks. A level of 1.6 simply is two coding blocks per three data blocks which can now survive losing two of the blocks. Using three coding blocks per three data blocks (not mentioned in the video) allows you to survive three failures at the cost of 1/2 the raw capacity which is clearly a win over simple replication.

The downside is that calculating the erasure coding is not cheap and it requires an extra block's worth of memory until it is complete. It is best to implement the coding at the client since it has all the data, while servers do not and would have to copy the data to the server performing the computation. It is possible to pipeline the storing of blocks and hopefully mask this but it adds to the requirements of the processors for normal usage (not to mention when handling failures). Also, if you need to read a block that is not available, you are no longer reading one block (e.g. of 4 MB), but the whole coding set (M blocks of 4 MBs) which increases the network traffic M times.

Erasure coding is no magic bullet and has a use, but it is complicated and increases computing resource requirements.

Scott

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html