Re: Erasure coding implementation : high level description

Loic Dachary <loic@xxxxxxxxxxx> · Fri, 05 Jul 2013 13:56:11 +0200

Hi Sage,

I wrote down my understanding of what you suggest in the "Interrupted append" part of

https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst

did I miss something ?

Cheers

On 02/07/2013 05:52, Sage Weil wrote:
> We had a chat about this this afternoon and another idea came up: support 
> only object append in full-stripe writes.  The primary would log the write 
> (as per usual), along with the write offset and length.  Each shard 
> processes its piece and extends the file. If there is a failure and the pg 
> log gets rolled back, we simply truncate off the incompletely 
> written/committed stripe from each shard.
> 
> This is more limited (no overwrites yet, append-only) but captures the 
> most important use-cases, it's super simple, and it's efficient.  It's 
> also simple enough that I don't think it commits us in any particular 
> direction if/when we later want to do per-stripe overwrites.
> 
> One thing it does bring up, though is how the stripe size is determined.  
> I suggest that it is specified by the writer on object creation (since the 
> writer is responsible for writing in stripe-aligned chunks) and is 
> recorded as immutable per-object metadata.  Maybe there is a per-pool 
> property to inform clients, but that is mostly just policy...
> 
> sage
> 
> 
> 
> 
> On Mon, 1 Jul 2013, Loic Dachary wrote:
> 
>> For the record,
>>
>> Sam suggested today that the chunks of a stripe ( an object if we limit ourselves to full writes ) are written without deleting the chunks from a previous version of the object. i.e. for instance
>>
>> object A1 contains "ABCDEFGHI" => version 1 of the object is written as chunks "ABC" "DEF" "GHI" and "XYZ" parity on OSD1, OSD2, OSD3, OSD4 respectively.
>> object A1 is updated to "ABCDEF123" => version 2 of the object is written as chunks "ABC" "DEF" "123" and "KLM" parity on OSD1, OSD2, OSD3, OSD4 respectively.
>>
>> At some point OSD3 contains both "GHI" ( chunk 3 object A1 version 1 ) and "123" ( chunk 3 object A1 version 2 ).
>>
>> When the PG receives an update of last_complete ( which should happen when the PG becomes active ) it knows that all objects with a version lower than last_complete can be discarded. It can then trim the objects stored on the OSD that have a version older than last_complete. With ReplicatedPG this does not need to be done because the new version of the object overrides the previous one. It could be done together with pg_log trimming but it would waste more disk space because the default log size it by default 3000 meaning a chunk would only be deleted from disk after 3000 pg_log_entry were added to pg_log. 
>>
>> The object name does not currently contain the version number and this would need to be changed to avoid name clashes.
>>
>> Cheers
>>
>> On 29/06/2013 18:56, Loic Dachary wrote:
>>> Hi Sage,
>>>
>>> The level of understanding of ReplicatedPG/PG/OSD required to sketch the path for implementing the erasure coding is beyond me at the moment. A few hours of browsing demonstrated that a number of important areas are still unknown to me. A meaningfull example is probably the logic associated with 
>>>
>>> struct AccessMode {
>>>
>>> https://github.com/ceph/ceph/blob/962b64a83037ff79855c5261325de0cd1541f582/src/osd/ReplicatedPG.h#L114
>>>
>>> I suspect there are a number of similarities with the erasure code that would be relevant to ensure that a stripe is fully written to disk ( i.e. in relation with the "ondisk" acknowledgment probably ) before removing the previous version of the same stripe from all OSDs supporting it.
>>>
>>> The time spent during this exploration was not wasted, I learnt a few things that will be useful :-) But I think it would be more useful for me to work on a more modest task to move in the direction of the erasure coding implementation.
>>>
>>> Cheers
>>>
>>> On 06/25/2013 07:41 PM, Loic Dachary wrote:
>>>> Hi Sage,
>>>>
>>>> Paraphrasing what you suggested today : 
>>>>
>>>> The logic for writing a stripe ( i.e. all the chunks created by the erasure encoding function for a given object or part of a given object if it exceeds the maximum size of a stripe ) for a single object is going to be done in a way that is not the same as what we currently have for replicated objects. The object is consistent when all chunks ( or at least K if K+M ) are committed to disk. It may make sense to start writing all the chunks in parallel and when they are acknowledged, send a pg_log event that says : now switch to this new version of the object. To avoid ending up with chunks that are partially for one version of the object and other chunks partially for another version of the object and we can't repair any of them. 
>>>>
>>>> I will try to sketch the path for implementing the erasure coding ( including the above ) by adding to https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst
>>>>
>>>> Cheers
>>>>
>>>
>>
>> -- 
>> Lo?c Dachary, Artisan Logiciel Libre
>> All that is necessary for the triumph of evil is that good people do nothing.
>>
>>

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

Attachment:
signature.asc

Description: OpenPGP digital signature