Erasure encoding as a storage backend

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Here is an updated description of the "Erasure encoding as a storage backend" proposed implementation that will be discussed during the ceph summit ( http://wiki.ceph.com/01Planning/Developer_Summit#Schedule ). The "strip" and "stripe" terms are illustrated at http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend#Proposed_model . 

I am well aware of the shortcomings of this proposal and it would be great to get feedback before the ceph summit to address the most prominent issues.

Cheers

http://pad.ceph.com/p/Erasure_encoding_as_a_storage_backend

	* PG and ReplicatedPG are reworked so that PG can be used as a base class for ErasureEncodedPG
		* Tests are written for ReplicatedPG to cover 100% of the LOC and most of the expected functionalities.
		* Code is reworked in PG and ReplicatedPG, moving from ReplicatedPG to PG code that is not unique to replication and from PG to ReplicatedPG code that is not generic enough to be useful for the ErasureEncodedPG base class.
	* To isolates ceph from the actual library being used ( zfec, fecpp, ... ), a wrapper around the erasure encoding library is implemented. Each block is encoded into k data blocks and m parity blocks
		* encode(void* data, k, m) => void* data[k], void* parity[m]
		* decode(void* data[k], void* parity[m]) => void* data
		* repair(void* data[k], void* parity[m], indices_of_damaged_blocks[]) => void* data
	* The ErasureEncodePG configuration is set to encode each object into k data objects and m parity objects. 
		* It use the parity ('INDEP') crush mode so that placement is intelligent. The indep  placement avoids moving around a shard between ranks, because a mapping  of [0,1,2,3,4] will change to [0,6,2,3,4] (or something) if osd.1 fails  and the shards on 2,3,4 won't need to be copied around.
		* The ErasureEncodedPG uses k + m OSDs, numbered Do .. Dk-1 and C0 ... Cm-1
		* Each object is a strip
		* Each stripe has a fixed size of B bytes
	* ErasureEncodedPG implementation
		* Write offset, length
			* read the stripes containing offset, length
			* for each stripe, decode(void* data[k], void* parity[m]) => void* data and append to a bufferlist
			* modify the bufferlist with the write request
			* encode(void* data, k, m) => void* data[k], void* parity[m]
			* write data[0] to Do, data[1] to D1 ... data[k-1] to Dk-1 and parity[0] to C0 ... parity[m-1] to Cm-1
		* Read offset, length
			* read the stripes containing offset
			* for each strip, decode(void* data[k], void* parity[m]) => void* data and append to a bufferlist
		* Object attributes
			* duplicate the object attributes on each OSD
		* Scrubbing
			* for each object, read each stripe and write back if a repair was necessary
		* Repair
			* when an OSD is decomissioned, when another OSD replaces it, for each object contained in a ErasureEncodedPG using this OSD, read the object, repair each strips and write back the strip that resides on the new OSD


-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

Attachment: signature.asc
Description: OpenPGP digital signature


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux