For the record, Sam suggested today that the chunks of a stripe ( an object if we limit ourselves to full writes ) are written without deleting the chunks from a previous version of the object. i.e. for instance object A1 contains "ABCDEFGHI" => version 1 of the object is written as chunks "ABC" "DEF" "GHI" and "XYZ" parity on OSD1, OSD2, OSD3, OSD4 respectively. object A1 is updated to "ABCDEF123" => version 2 of the object is written as chunks "ABC" "DEF" "123" and "KLM" parity on OSD1, OSD2, OSD3, OSD4 respectively. At some point OSD3 contains both "GHI" ( chunk 3 object A1 version 1 ) and "123" ( chunk 3 object A1 version 2 ). When the PG receives an update of last_complete ( which should happen when the PG becomes active ) it knows that all objects with a version lower than last_complete can be discarded. It can then trim the objects stored on the OSD that have a version older than last_complete. With ReplicatedPG this does not need to be done because the new version of the object overrides the previous one. It could be done together with pg_log trimming but it would waste more disk space because the default log size it by default 3000 meaning a chunk would only be deleted from disk after 3000 pg_log_entry were added to pg_log. The object name does not currently contain the version number and this would need to be changed to avoid name clashes. Cheers On 29/06/2013 18:56, Loic Dachary wrote: > Hi Sage, > > The level of understanding of ReplicatedPG/PG/OSD required to sketch the path for implementing the erasure coding is beyond me at the moment. A few hours of browsing demonstrated that a number of important areas are still unknown to me. A meaningfull example is probably the logic associated with > > struct AccessMode { > > https://github.com/ceph/ceph/blob/962b64a83037ff79855c5261325de0cd1541f582/src/osd/ReplicatedPG.h#L114 > > I suspect there are a number of similarities with the erasure code that would be relevant to ensure that a stripe is fully written to disk ( i.e. in relation with the "ondisk" acknowledgment probably ) before removing the previous version of the same stripe from all OSDs supporting it. > > The time spent during this exploration was not wasted, I learnt a few things that will be useful :-) But I think it would be more useful for me to work on a more modest task to move in the direction of the erasure coding implementation. > > Cheers > > On 06/25/2013 07:41 PM, Loic Dachary wrote: >> Hi Sage, >> >> Paraphrasing what you suggested today : >> >> The logic for writing a stripe ( i.e. all the chunks created by the erasure encoding function for a given object or part of a given object if it exceeds the maximum size of a stripe ) for a single object is going to be done in a way that is not the same as what we currently have for replicated objects. The object is consistent when all chunks ( or at least K if K+M ) are committed to disk. It may make sense to start writing all the chunks in parallel and when they are acknowledged, send a pg_log event that says : now switch to this new version of the object. To avoid ending up with chunks that are partially for one version of the object and other chunks partially for another version of the object and we can't repair any of them. >> >> I will try to sketch the path for implementing the erasure coding ( including the above ) by adding to https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst >> >> Cheers >> > -- Loïc Dachary, Artisan Logiciel Libre All that is necessary for the triumph of evil is that good people do nothing.
Attachment:
signature.asc
Description: OpenPGP digital signature