Please see my previous mail (expermient 2). We will make a public repo after source code is reworked based on community feedback. thanks. 2016-12-15 20:07 GMT+09:00 Joao Eduardo Luis <joao@xxxxxxx>: > This looks interesting. > > Is there a repo we can look at with this implementation? > > Do you have any initial performance evaluations? > > -Joao > > > On 12/14/2016 10:57 AM, myoungwon oh wrote: >> >> Hi sage, ceph developers >> >> I am a system software engineer working at SK telecom. >> We are developing Data deduplication in Ceph, based on discussions of >> Ceph community and our own research. >> Current status is as below, we implemented prototype, and we want to >> share our result to Ceph community and want to get your feedback. >> >> 1. Motivation >> We studied which structure deduplication is suitable for Ceph, the >> distributed file system of shared-nothing architecture. As a result, >> it seems the most reasonable method is to utilize data distribution >> using hash value, which is used in Shared-nothing architecture and >> tiering. >> The reason is as follows >> 1- There is no change in current structure, and there is no big >> additional change. >> 2- There is no need to develop and configure a separated deduplication >> metadata server. >> 3- EC and Replication can be selected and used. That is, since >> deduplication is performed in cache tier, existing features of below >> layer can be utilized as it is. >> 4- Don't need to consider data placement, load balancing and recovery. >> Because existing structure is used, features mentioned above will be >> handled as in original Ceph. So only the metadata part need to be >> handled. >> >> 2. Design concept >> Main design concept is double CRUSH + cache tier. >> (http://imgur.com/9F3jQA6) >> >> "double CRUSH" >> Hash the incoming data to an OID, and assign that hash value as a new OID. >> For example, OID : DATA -->(hash)--> FP_OID : DATA >> >> "cache tier" >> Redirection is required for new OID(hash value). If we use cache tier >> which is currently implemented, we can receive request at Cache tier >> first and then, can redirecting the request to the actual storage >> tier. >> Currently inline processing is possible using the proxy mode of cache >> tier. And also inline+post processing is possible using the writeback >> mode of cache tier. >> >> 3. Design detail >> (http://imgur.com/qUQ5e44) >> >> "I/O flow (inline mode) >> Write : write request -> fingerprint calculation -> search lookup >> table(OID <-> Fingerprint(FP) OID mapping) -> >> (If data exist) for old FP_OID, decrease reference count through >> setattr of Objecter -> Increase reference count >> through setxattr with new FP_OID -> update lookup table – complete >> >> Read : read request -> search lookup table -> (If data exist) Request >> data with mapped FP_OID >> >> "I/O flow (inline+post processing) >> Write : flush event occur -> Request COPY_FROM by dividing object into >> fixed size chunk -> update lookup table >> >> Read : promote_object requests a COPY_GET, with chunk as unit. >> >> "chunking" >> Currently, 4K, 128K, 256K, and 512K chunk size are available as fixed >> chunk size. >> >> "Lookup table" >> Lookup table manages the OID and fingerprint mapping and status(number >> of chunks, state) of deduped object. >> The matching of OID, offset and fingerprint for each request is >> processed through this table. >> >> "metadata recovery & replication" >> Data recovery is done by the existing Ceph structure. Therefore, >> additional implementation is needed for deduplication metadata >> management. Newly added data structure is only a lookup table. >> We can secure the reliability by using existing replication >> structure.(When lookup table updated, request it to another OSDs, that >> having same replicated table in them for syncing of entries) >> >> "metadata cache" >> Metadata & data cache is one of the important structures that >> determine the performance of deduplication. Currently it is >> implemented through simple LRU and Level-DB,but we are planning to >> develop additional improved algorithms and data structures. >> >> 4. Prototype evaluation (inline mode, 512K chunk, RBD, Seq. write) >> >> "Experiment 1 (http://imgur.com/Fdq7mWr)” >> >> X axis represents the number of stored OS images, and the Y axis >> represents the total storage capacity as the X axis value increases. >> (Images are based on Cent OS 7.0) In addition to being able to remove >> the same block from the same image >> (if the OS image number is 1), can see only an additional 50MB is >> stored if stores similar image. >> >> "Experiment 2 (http://imgur.com/wHGVNMq)" >> >> The actual data rate, is the data stored in the ceph cluster / data >> stored by the client. In case of Dedup ratio 20, the actual storage >> ratio is slightly higher because of ceph and dedup metadata, but in >> the remaining cases we can see it is close to ideal value. >> In performance perspective, we can see the performance become half, >> because of additional computations of Proxy structure and >> deduplication.(current implementation (inline) must redirect write >> message to storage tier. dedupe is done at storage tier. This can be >> fixed later) >> >> 5. Weak point >> Fragmentation issue occurs because we can't determine the chunk >> placement. Deduplication always has a fragmentation issue, but it can >> be worse because we can't involve in data placement. >> However, since we use flash device as main target, the performance of >> flash can be increased because chunks are striped. >> If not, it would be okay with using a relatively large chunk size(256K, >> 512K) >> >> >> Thanks. >> Regards >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html