Re: object versioning

Sage Weil <sweil@xxxxxxxxxx> · Thu, 14 Aug 2014 11:51:15 -0700 (PDT)

On Wed, 13 Aug 2014, Yehuda Sadeh wrote:
> One of the next features that we're working on is the long due object
> versioning. This basically allows keeping old versions of objects
> inside buckets, even if user has removed or overwritten them. Any
> object instance is immutable. and object can then be fetched by the
> version (instance) id of that object.
> When removing the object without specifying a version, a new deletion
> marker is created. It is, however, possible to remove a specific
> object version, and in this case the version is not accessible
> anymore. What complicates things is that if the current object's
> version (the one that is accessed when accessing the object without
> specifying a version) is removed, then the object will then point at
> its previous version. Permissions are set on the object version level.
> 
> Another requirement is the ability to list all objects and versions of
> the objects. This means that when listing objects we either need to
> list only the current objects, or both the current objects and their
> respective versions.
> One thing to note is that object versioning needs to be switched on
> for the bucket for the feature to be activated, and once it's switched
> on it can only be suspended. This means that newly created objects
> will not be versioned, but old versions will still be accessible.
> 
> Let's sum up the functionality:
>  - ability to list objects and versions

Is this actually two things?

 1- The regular bucket list will include all versions of all objects.
 2- A new operation will list all version of a given object.

Or would you just specify the prefix to be the object name and do the 
bucket list to get all versions of object foo?

>  - ability to read specific object version
>  - ability to remove a specific object version (*)
>  - object creation / overwrite creates a new object version, object
> points at new instance
>  - object removal does not remove object instance, creates a deletion marker
>  - (*) removal of the current object version rolls back object to
> point at previous object version
>  - permissions affect the object version and can be set on the versions

Throwing in a couple of goals here too:

 - a GET can still be serviced by going directly to librados objects, 
without consulting an index (and breaking read-side bucket scalability)
 - a bucket listing is still reasonably efficient (normally performed by 
consulting the index object only).

> Now, considering this functionality, it seems that we need to deal
> with 3 different entities:
>  - bucket index
>  - object instances (versions)
>  - object logical head (olh)
> 
> The first two can be mapped nicely into the already existing
> structures. The existing bucket index will be extended to keep the
> list of versions, and our current rgw objects will be used to handle
> the object instances, as they serve the same function.

I think there is one differnce, though: before the head would be 
addressible by the object name, whereas here it is object name + 
tag/version... right?  So that the heads don't collide with other 
object versions.

> One of the options that we can consider for the object logical head is
> also to use a regular object that will just have a copy of the
> appropriate instance manifest. It doesn't seem that this will function
> as needed, as it doesn't satisfy the last requirement (permissions are
> set at the version level). What we do need to have is some sort of a
> soft link that will be used to point at the appropriate object
> instance.
> 
> We had internal discussions on how to make everything work together.
> There are a few things that we need to be careful about. We need to
> make sure that the bucket index listing reflects the status of the
> actual objects. When the olh points at a specific version, we
> shouldn't show a different view when listing the objects. This gets
> even more complicated when removing an object version that requires
> olh change, as we have 3 different entities that we need to sync. Note
> that rados does not have multi-object transactions (for now), and we
> traditionally avoided locking for rgw object operations.

(those 3 entities being the index, the object version, and the olh 
pointer)

> The current scheme is that we update the bucket index using a 2 phase
> commit, and it follows up on the objects state. So when adding /
> removing an object, we first tell the bucket index to 'prepare' for
> the operation, then do the operation, and eventually we let the bucket
> index know about the completion. For ordering we rely on the pg
> versioning system that gives us insight into the timeline, so that
> when two concurrent operations happen on the same object the bucket
> index can figure out who won and who is dead.
> This system as it is doesn't really work with versioning as we have
> both the olh, and the object instances. This is one of the solutions
> that we came up with:
> 
>  - The bucket index will be the source of the truth
>  - The bucket index will serve as an operational log for olh operations
> 
> The bucket index will index every object instance in reverse order
> (from new to old). The bucket index will keep entries for deletion
> markers.
> The bucket index will also keep operations journal for olh
> modifications. Each operation in this journal will have an id that
> will be increased monotonically, and that will be tied into current
> olh version. The olh will be modified using idempotent operations that
> will be subject to having its current version smaller than the
> operation id.
> The journal will be used for keeping order, and the entries in the
> journal will serve as a blueprint that the gateways will need to
> follow when applying changes. In order to ensure that operations that
> needed to be complete were done, we'll mark the olh before going to
> the bucket index, so that if the gateway died before completing the
> operation, next time we try to access the object we'll know that we
> need to go to the bucket index and complete the operation.
> 
> Things will then work like this:

I take it there is also a:

* object read

1. look at olh
2. if marked as pending-modify,
   a. check index for current head version, and use that vaue
   b. if pending-modify is super old and no matching index entry exists, 
      remove marker
   b. if index entry does exist, send async op to roll-forward the olh
3. read referenced object version

...and the 'roll-forward' on the olh would be something like

 cmpxattr pending-modify-$tag == 1
 cmpxattr olh_version == previous v
 setxattr olh_version = new v
 setxattr head_version = whatever
 rmxattr pending-modify-$tag

This has the side-effect that a hot object will briefly pummel the index.  
That is probably fine...

> * object creation
> 
> 1. Create object instance

 is there a step 0 so that a failed rgw gets garbage collected?

> 2. Mark olh that it's about to be modified

 setxattr pending-modify-$tag=1

> 3. Update bucket index about new object instance

 omap_setkeys journal_$object_$olhversion_$tag = pending ?

> 4. Read bucket index object op journal
> 
> Note that the journal should have at this point an entry that says
> 'point olh to specific object version, subject to olh is at version
> X'.
> 
> 5. Apply journal ops

same as roll-forward event above?  unmark olh in the same op:

 cmpxattr pending-modify-$tag == 1
 cmpxattr olh_version == $olh_version_old
 setxattr olh_version = $olh_version_new
 setxattr head_version = whatever
 rmxattr pending-modify-$tag

> 6. Trim journal, unmark olh

Just trim the journal.

 call rgw.trim_journal($object, $olh_version_new)

...which can remove all prior journal entries too, since the olh is now at 
that version (or something higher).

Am I on the right track?

sage

> * object removal (olh)
> 
> 1. Mark olh that it's about to be modified
> 2. Update bucket index about the new deletion marker
> 3. Read bucket index object op journal
> 
> The journal entry should say something like 'mark olh as removed,
> subject to olh is at version X'
> 
> 4. Apply ops
> 5. Trim journal, unmark olh
> 
> Another option is to actually remove the olh, but in this case we'll
> lose the olh versioning. We can in that case use the object
> non-existent state as a check, but that will not be enough as there
> are some corner cases where we could end up with the olh pointing at
> the wrong object.
> 
> * object version removal
> 
> 1. Mark olh as it will potentially be modified
> 2. Update bucket index about object instance removal
> 3. Read bucket index op journal
> 4. apply ops journal ...
> Now the journal might just say something like 'remove object
> instance', which means that the olh was pointing at a different object
> version. The more interesting case is when the olh pointing at this
> specific object version. In this case the journal will say something
> like 'first point the olh at version V2, subject to olh is at version
> X. Now, remove object instance V1'.
> 
> 5. Trim journal, unmark olh
> 
> 
> Note about olh marking: The olh mark will create an attr on the olh
> that will have an id and a timestamp. There could be multiple marks on
> the olh, and the marks should have some expiration, so that operations
> that did not really start would be removed after a while.
> 
> 
> Let me know if that makes sense, or if you have any questions.
> 
> Thanks,
> Yehuda
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html