Hi Marcel, On Fri, 1 Apr 2016, Marcel Lauhoff wrote: > Hi Ceph, > > deduplication has been discussed on the list a couple of times. > Over the next months I'll be working on a prototype. > > In short: Use a content-addressed storage pool backed by a pool > acting as storage and distributed fingerprint index. > > Two pools: (1) pool that does the content addressing, (2) storage / > index pool. > > OSDs in the first pool readdress and chuck/reassemble objects. > They then store the new objects/chunks in a second pool. I think this is the right architecture for dedup in Ceph, and matches the ideas we've been kicking around. > The first pool uses a new PG backend ("CAS Backend"), > while the second can use replication or erasure coding. > > The CAS backend computes fingerprints for incoming objects and > stores the fingerprint <-> original object name mapping. > It then forwards the data to a storage pool, addressing the objects by > fingerprint (the content defined name). > > The storage pool therefore serves as a distributed fingerprint index. > CRUSH selects the responsible OSDs. The OSDs know their objects. > > Deduplication happens when two objects/chunks have the same > fingerprint. This is a little different, though. The plan so far has been to match this up with the next stage of tiering. We'll add the ability for and object to be a 'redirect' and store a bit of metadata indicating where to look next. That might be a simple as "go look in this cold RADOS pool over there," or a URL into another storage system (e.g., a tape archive), or.. a complicated mapping of bytes to CAS chunks in another rados pool. The original thought was that this would just be a regular ReplicatedPG, not a new pool type. I haven't thought about what we'd gain by having a new pool type. One thing we get by using the existing pool is that we're not forced to do the demotion/dedup immediately--we can just store the object normally, and dedup it later when we decide it's cold. For the CAS pool, the idea would be to use the refcount class, or something like it, so that you'd say "write object $hash" and if the object already exists it'd increment the ref count. Similarly, when you delete the logical object, you do a refcount 'put' on each chunk, and the chunk would only go away when the last ref did too. (In practice we need to be careful to avoid leaked refs in the case of failures; this would probably be done by having a 'deduping' and 'deleting' state on the logical object and named references. > My current milestones: > - Develop CAS backend, fingerprinting, recipes store > - Support limited set of operations (like EC does) > - Support RBD (with/without Cache) and evaluate > - Add Chunking, Garbage Collection, .. > > Currently I'm adding a new PG backend into the OSD code base. I'll > push the code the my github clone as soon as it does "something" :) This would be a good thing to discuss during the Ceph Developer Monthly call next Wednesday: http://tracker.ceph.com/projects/ceph/wiki/Planning http://tracker.ceph.com/projects/ceph/wiki/CDM_06-APR-2016 sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html