Re: Best insertion point for storage shim

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 24 Aug 2012, Stephen Perkins wrote:
> Hi all,
> 
> I'd like to get feedback from folks as to where the best place would be to
> insert a "shim" into the RADOS object storage.
> 
> Currently, you can configure RADOS to use copy based storage to store
> redundant copies of a file (I like 3 redundant copies so I will use that as
> an example).  So... each file is stored in three locations on independent
> hardware.   The redundancy has a cost of 3x the storage.
> 
> I would assume that it is "possible" to configure RADOS to store only 1 copy
> of a file (bear with me here).
> 
> I'd like to see where it may be possible to insert a "shim" in the storage
> such that I can take the file to be stored and apply some erasure coding to
> it. Therefore, the file now becomes multiple files that are handed off to
> RADOS.  
> 
> The shim would also have to take read file requests and read some small
> portion of the fragments and recombine.
> 
> Basically... what I am asking is...  where would be the best place to start
> looking at adding this:
> 	https://tahoe-lafs.org/trac/tahoe-lafs#
> 	
> (just the erasure coded part).
> 
> Here is the real rationale.  Extreme availability at only 1.3 or 1.6 time
> redundancy:
> 
> 	http://www.zdnet.com/videos/whiteboard/dispersed-storage/156114
> 
> Thoughts appreciated,

The good news is that CRUSH has a mode that is intended for erasure/parity 
coding, and there are fields reserved in many ceph structures to support 
this type of thing.  The bad news is that in order to make it work it 
needs to live inside of rados, not on top of it.  The reason is that you 
need to separate the fragments across devices/failure domains/etc, which 
happens at the PG level; users of librados have no control over that 
(objects are randomly hashed into PGs, and then PGs are mapped to 
devices).

And in order to implement it properly, a lot of code shuffling and wading 
through OSD internals will be necessary.  There are some basic 
abstractions in place, but they are largely ignored and need to be shifted 
around because replication has been the only implementation for some time 
now.

I think the only way to layer this on top of rados and align your 
fragments with failure domains would be to create N different pools with 
distinct devices, and store one fragment in each pool...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux