Started developing a deduplication feature

Marcel Lauhoff <lauhoff@xxxxxxxxxxxx> · Fri, 1 Apr 2016 19:25:57 +0200

Hi Ceph,

deduplication has been discussed on the list a couple of times.
Over the next months I'll be working on a prototype.

In short: Use a content-addressed storage pool backed by a pool
acting as storage and distributed fingerprint index.

Two pools: (1) pool that does the content addressing, (2) storage / index pool.

OSDs in the first pool readdress and chuck/reassemble objects.
They then store the new objects/chunks in a second pool.
The first pool uses a new PG backend ("CAS Backend"),
while the second can use replication or erasure coding.

The CAS backend computes fingerprints for incoming objects and
stores the fingerprint <-> original object name mapping.
It then forwards the data to a storage pool, addressing the objects by
fingerprint (the content defined name).

The storage pool therefore serves as a distributed fingerprint index.
CRUSH selects the responsible OSDs. The OSDs know their objects.

Deduplication happens when two objects/chunks have the same
fingerprint.

My current milestones:
- Develop CAS backend, fingerprinting, recipes store
- Support limited set of operations (like EC does)
- Support RBD (with/without Cache) and evaluate
- Add Chunking, Garbage Collection, ..

Currently I'm adding a new PG backend into the OSD code base. I'll
push the code the my github clone as soon as it does "something" :)

~irq0

--
Marcel Lauhoff
Mail: lauhoff@xxxxxxxxxxxx
XMPP: mlauhoff@xxxxxxxxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html