Started developing a deduplication feature

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Ceph,

deduplication has been discussed on the list a couple of times.
Over the next months I'll be working on a prototype.

In short: Use a content-addressed storage pool backed by a pool
acting as storage and distributed fingerprint index.



Two pools: (1) pool that does the content addressing, (2) storage / index pool.

OSDs in the first pool readdress and chuck/reassemble objects.
They then store the new objects/chunks in a second pool.
The first pool uses a new PG backend ("CAS Backend"),
while the second can use replication or erasure coding.

The CAS backend computes fingerprints for incoming objects and
stores the fingerprint <-> original object name mapping.
It then forwards the data to a storage pool, addressing the objects by
fingerprint (the content defined name).

The storage pool therefore serves as a distributed fingerprint index.
CRUSH selects the responsible OSDs. The OSDs know their objects.

Deduplication happens when two objects/chunks have the same
fingerprint.

My current milestones:
- Develop CAS backend, fingerprinting, recipes store
- Support limited set of operations (like EC does)
- Support RBD (with/without Cache) and evaluate
- Add Chunking, Garbage Collection, ..

Currently I'm adding a new PG backend into the OSD code base. I'll
push the code the my github clone as soon as it does "something" :)

~irq0

--
Marcel Lauhoff
Mail: lauhoff@xxxxxxxxxxxx
XMPP: mlauhoff@xxxxxxxxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux