Hi all, Excellent points all. Many thanks for the clarification. *Tommi* - Yep... I wrote file but had object in mind. However... now that you bring the distinction... I may actually mean file. (see below) I don't know Zooko personally, but will definitely pass it on if I meet him! *Scott* - Agreed. As to the performance... I am also in agreement. My thoughts were to make the operation lazy. By this, I mean that items that are not due to change much (think archive items) could be converted from N Copies to an Erasure coded equivalent. The Lazy piece of that would help reduce the processing overhead. The encoded items could also be "on demand" decoded back to N items if such items are accessed over a given threshold. This is not exactly tiered storage... but has many of the same characteristics. Sage... it may be that your second approach of having N storage pools and writing to them is the best approach. The reasoning is that I'm not sure that RADOS will have any idea of "which" objects would be candidates for lazy erasure coding. If done closer to a POSIX level, then files and directories that have not been accessed recently could become candidates for the coding. My personal desire is to have this available for archiving large file based datasets in an economical fashion. The files would be "generated" by commercial file archiving software (with the source data contained on other systems) and would be stored on a ceph cluster via either CephFS or an RBD device with a standard file system on it. Then, because of domain specific knowledge about the data (i.e. it is archive data), I would know that much of the data will probably never be touched again. IMHO, that would be good candidate data for erasure coding. One approach would be to have a standard cephFS mounted and configured with RADOS to keep N copies of data. The second would be to have a new RScephFS (Reed-Solomon encoded) mount point (possibly using Sage's many storage pools approach) . Then.. .using available tiering software, files could be "moved" from one mount point to the other based on some criteria. Pointers on the original mount point make this basically invisible. If a files is access too many times, it can be "moved" back to the cephFS mount point. Would this require 2 clusters because of the need to have RADOS keep N copies on one and 1 copy on the other? I appreciate the discussion... it is helping me fashion what I'm really interested in... - Steve -----Original Message----- From: Sage Weil [mailto:sage@xxxxxxxxxxx] Sent: Friday, August 24, 2012 11:43 AM To: Stephen Perkins Cc: ceph-devel@xxxxxxxxxxxxxxx Subject: Re: Best insertion point for storage shim On Fri, 24 Aug 2012, Stephen Perkins wrote: > Hi all, > > I'd like to get feedback from folks as to where the best place would > be to insert a "shim" into the RADOS object storage. > > Currently, you can configure RADOS to use copy based storage to store > redundant copies of a file (I like 3 redundant copies so I will use > that as an example). So... each file is stored in three locations on independent > hardware. The redundancy has a cost of 3x the storage. > > I would assume that it is "possible" to configure RADOS to store only > 1 copy of a file (bear with me here). > > I'd like to see where it may be possible to insert a "shim" in the > storage such that I can take the file to be stored and apply some > erasure coding to it. Therefore, the file now becomes multiple files > that are handed off to RADOS. > > The shim would also have to take read file requests and read some > small portion of the fragments and recombine. > > Basically... what I am asking is... where would be the best place to > start looking at adding this: > https://tahoe-lafs.org/trac/tahoe-lafs# > > (just the erasure coded part). > > Here is the real rationale. Extreme availability at only 1.3 or 1.6 > time > redundancy: > > http://www.zdnet.com/videos/whiteboard/dispersed-storage/156114 > > Thoughts appreciated, The good news is that CRUSH has a mode that is intended for erasure/parity coding, and there are fields reserved in many ceph structures to support this type of thing. The bad news is that in order to make it work it needs to live inside of rados, not on top of it. The reason is that you need to separate the fragments across devices/failure domains/etc, which happens at the PG level; users of librados have no control over that (objects are randomly hashed into PGs, and then PGs are mapped to devices). And in order to implement it properly, a lot of code shuffling and wading through OSD internals will be necessary. There are some basic abstractions in place, but they are largely ignored and need to be shifted around because replication has been the only implementation for some time now. I think the only way to layer this on top of rados and align your fragments with failure domains would be to create N different pools with distinct devices, and store one fragment in each pool... sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html