Hi Ceph, deduplication has been discussed on the list a couple of times. Over the next months I'll be working on a prototype. In short: Use a content-addressed storage pool backed by a pool acting as storage and distributed fingerprint index. Two pools: (1) pool that does the content addressing, (2) storage / index pool. OSDs in the first pool readdress and chuck/reassemble objects. They then store the new objects/chunks in a second pool. The first pool uses a new PG backend ("CAS Backend"), while the second can use replication or erasure coding. The CAS backend computes fingerprints for incoming objects and stores the fingerprint <-> original object name mapping. It then forwards the data to a storage pool, addressing the objects by fingerprint (the content defined name). The storage pool therefore serves as a distributed fingerprint index. CRUSH selects the responsible OSDs. The OSDs know their objects. Deduplication happens when two objects/chunks have the same fingerprint. My current milestones: - Develop CAS backend, fingerprinting, recipes store - Support limited set of operations (like EC does) - Support RBD (with/without Cache) and evaluate - Add Chunking, Garbage Collection, .. Currently I'm adding a new PG backend into the OSD code base. I'll push the code the my github clone as soon as it does "something" :) ~irq0 -- Marcel Lauhoff Mail: lauhoff@xxxxxxxxxxxx XMPP: mlauhoff@xxxxxxxxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html