This looks interesting. Is there a repo we can look at with this implementation? Do you have any initial performance evaluations? -Joao On 12/14/2016 10:57 AM, myoungwon oh wrote:
Hi sage, ceph developers I am a system software engineer working at SK telecom. We are developing Data deduplication in Ceph, based on discussions of Ceph community and our own research. Current status is as below, we implemented prototype, and we want to share our result to Ceph community and want to get your feedback. 1. Motivation We studied which structure deduplication is suitable for Ceph, the distributed file system of shared-nothing architecture. As a result, it seems the most reasonable method is to utilize data distribution using hash value, which is used in Shared-nothing architecture and tiering. The reason is as follows 1- There is no change in current structure, and there is no big additional change. 2- There is no need to develop and configure a separated deduplication metadata server. 3- EC and Replication can be selected and used. That is, since deduplication is performed in cache tier, existing features of below layer can be utilized as it is. 4- Don't need to consider data placement, load balancing and recovery. Because existing structure is used, features mentioned above will be handled as in original Ceph. So only the metadata part need to be handled. 2. Design concept Main design concept is double CRUSH + cache tier. (http://imgur.com/9F3jQA6) "double CRUSH" Hash the incoming data to an OID, and assign that hash value as a new OID. For example, OID : DATA -->(hash)--> FP_OID : DATA "cache tier" Redirection is required for new OID(hash value). If we use cache tier which is currently implemented, we can receive request at Cache tier first and then, can redirecting the request to the actual storage tier. Currently inline processing is possible using the proxy mode of cache tier. And also inline+post processing is possible using the writeback mode of cache tier. 3. Design detail (http://imgur.com/qUQ5e44) "I/O flow (inline mode) Write : write request -> fingerprint calculation -> search lookup table(OID <-> Fingerprint(FP) OID mapping) -> (If data exist) for old FP_OID, decrease reference count through setattr of Objecter -> Increase reference count through setxattr with new FP_OID -> update lookup table – complete Read : read request -> search lookup table -> (If data exist) Request data with mapped FP_OID "I/O flow (inline+post processing) Write : flush event occur -> Request COPY_FROM by dividing object into fixed size chunk -> update lookup table Read : promote_object requests a COPY_GET, with chunk as unit. "chunking" Currently, 4K, 128K, 256K, and 512K chunk size are available as fixed chunk size. "Lookup table" Lookup table manages the OID and fingerprint mapping and status(number of chunks, state) of deduped object. The matching of OID, offset and fingerprint for each request is processed through this table. "metadata recovery & replication" Data recovery is done by the existing Ceph structure. Therefore, additional implementation is needed for deduplication metadata management. Newly added data structure is only a lookup table. We can secure the reliability by using existing replication structure.(When lookup table updated, request it to another OSDs, that having same replicated table in them for syncing of entries) "metadata cache" Metadata & data cache is one of the important structures that determine the performance of deduplication. Currently it is implemented through simple LRU and Level-DB,but we are planning to develop additional improved algorithms and data structures. 4. Prototype evaluation (inline mode, 512K chunk, RBD, Seq. write) "Experiment 1 (http://imgur.com/Fdq7mWr)” X axis represents the number of stored OS images, and the Y axis represents the total storage capacity as the X axis value increases. (Images are based on Cent OS 7.0) In addition to being able to remove the same block from the same image (if the OS image number is 1), can see only an additional 50MB is stored if stores similar image. "Experiment 2 (http://imgur.com/wHGVNMq)" The actual data rate, is the data stored in the ceph cluster / data stored by the client. In case of Dedup ratio 20, the actual storage ratio is slightly higher because of ceph and dedup metadata, but in the remaining cases we can see it is close to ideal value. In performance perspective, we can see the performance become half, because of additional computations of Proxy structure and deduplication.(current implementation (inline) must redirect write message to storage tier. dedupe is done at storage tier. This can be fixed later) 5. Weak point Fragmentation issue occurs because we can't determine the chunk placement. Deduplication always has a fragmentation issue, but it can be worse because we can't involve in data placement. However, since we use flash device as main target, the performance of flash can be increased because chunks are striped. If not, it would be okay with using a relatively large chunk size(256K, 512K) Thanks. Regards -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html