On Wed, 13 Aug 2014, Yehuda Sadeh wrote: > One of the next features that we're working on is the long due object > versioning. This basically allows keeping old versions of objects > inside buckets, even if user has removed or overwritten them. Any > object instance is immutable. and object can then be fetched by the > version (instance) id of that object. > When removing the object without specifying a version, a new deletion > marker is created. It is, however, possible to remove a specific > object version, and in this case the version is not accessible > anymore. What complicates things is that if the current object's > version (the one that is accessed when accessing the object without > specifying a version) is removed, then the object will then point at > its previous version. Permissions are set on the object version level. > > Another requirement is the ability to list all objects and versions of > the objects. This means that when listing objects we either need to > list only the current objects, or both the current objects and their > respective versions. > One thing to note is that object versioning needs to be switched on > for the bucket for the feature to be activated, and once it's switched > on it can only be suspended. This means that newly created objects > will not be versioned, but old versions will still be accessible. > > Let's sum up the functionality: > - ability to list objects and versions Is this actually two things? 1- The regular bucket list will include all versions of all objects. 2- A new operation will list all version of a given object. Or would you just specify the prefix to be the object name and do the bucket list to get all versions of object foo? > - ability to read specific object version > - ability to remove a specific object version (*) > - object creation / overwrite creates a new object version, object > points at new instance > - object removal does not remove object instance, creates a deletion marker > - (*) removal of the current object version rolls back object to > point at previous object version > - permissions affect the object version and can be set on the versions Throwing in a couple of goals here too: - a GET can still be serviced by going directly to librados objects, without consulting an index (and breaking read-side bucket scalability) - a bucket listing is still reasonably efficient (normally performed by consulting the index object only). > Now, considering this functionality, it seems that we need to deal > with 3 different entities: > - bucket index > - object instances (versions) > - object logical head (olh) > > The first two can be mapped nicely into the already existing > structures. The existing bucket index will be extended to keep the > list of versions, and our current rgw objects will be used to handle > the object instances, as they serve the same function. I think there is one differnce, though: before the head would be addressible by the object name, whereas here it is object name + tag/version... right? So that the heads don't collide with other object versions. > One of the options that we can consider for the object logical head is > also to use a regular object that will just have a copy of the > appropriate instance manifest. It doesn't seem that this will function > as needed, as it doesn't satisfy the last requirement (permissions are > set at the version level). What we do need to have is some sort of a > soft link that will be used to point at the appropriate object > instance. > > We had internal discussions on how to make everything work together. > There are a few things that we need to be careful about. We need to > make sure that the bucket index listing reflects the status of the > actual objects. When the olh points at a specific version, we > shouldn't show a different view when listing the objects. This gets > even more complicated when removing an object version that requires > olh change, as we have 3 different entities that we need to sync. Note > that rados does not have multi-object transactions (for now), and we > traditionally avoided locking for rgw object operations. (those 3 entities being the index, the object version, and the olh pointer) > The current scheme is that we update the bucket index using a 2 phase > commit, and it follows up on the objects state. So when adding / > removing an object, we first tell the bucket index to 'prepare' for > the operation, then do the operation, and eventually we let the bucket > index know about the completion. For ordering we rely on the pg > versioning system that gives us insight into the timeline, so that > when two concurrent operations happen on the same object the bucket > index can figure out who won and who is dead. > This system as it is doesn't really work with versioning as we have > both the olh, and the object instances. This is one of the solutions > that we came up with: > > - The bucket index will be the source of the truth > - The bucket index will serve as an operational log for olh operations > > The bucket index will index every object instance in reverse order > (from new to old). The bucket index will keep entries for deletion > markers. > The bucket index will also keep operations journal for olh > modifications. Each operation in this journal will have an id that > will be increased monotonically, and that will be tied into current > olh version. The olh will be modified using idempotent operations that > will be subject to having its current version smaller than the > operation id. > The journal will be used for keeping order, and the entries in the > journal will serve as a blueprint that the gateways will need to > follow when applying changes. In order to ensure that operations that > needed to be complete were done, we'll mark the olh before going to > the bucket index, so that if the gateway died before completing the > operation, next time we try to access the object we'll know that we > need to go to the bucket index and complete the operation. > > Things will then work like this: I take it there is also a: * object read 1. look at olh 2. if marked as pending-modify, a. check index for current head version, and use that vaue b. if pending-modify is super old and no matching index entry exists, remove marker b. if index entry does exist, send async op to roll-forward the olh 3. read referenced object version ...and the 'roll-forward' on the olh would be something like cmpxattr pending-modify-$tag == 1 cmpxattr olh_version == previous v setxattr olh_version = new v setxattr head_version = whatever rmxattr pending-modify-$tag This has the side-effect that a hot object will briefly pummel the index. That is probably fine... > * object creation > > 1. Create object instance is there a step 0 so that a failed rgw gets garbage collected? > 2. Mark olh that it's about to be modified setxattr pending-modify-$tag=1 > 3. Update bucket index about new object instance omap_setkeys journal_$object_$olhversion_$tag = pending ? > 4. Read bucket index object op journal > > Note that the journal should have at this point an entry that says > 'point olh to specific object version, subject to olh is at version > X'. > > 5. Apply journal ops same as roll-forward event above? unmark olh in the same op: cmpxattr pending-modify-$tag == 1 cmpxattr olh_version == $olh_version_old setxattr olh_version = $olh_version_new setxattr head_version = whatever rmxattr pending-modify-$tag > 6. Trim journal, unmark olh Just trim the journal. call rgw.trim_journal($object, $olh_version_new) ...which can remove all prior journal entries too, since the olh is now at that version (or something higher). Am I on the right track? sage > * object removal (olh) > > 1. Mark olh that it's about to be modified > 2. Update bucket index about the new deletion marker > 3. Read bucket index object op journal > > The journal entry should say something like 'mark olh as removed, > subject to olh is at version X' > > 4. Apply ops > 5. Trim journal, unmark olh > > Another option is to actually remove the olh, but in this case we'll > lose the olh versioning. We can in that case use the object > non-existent state as a check, but that will not be enough as there > are some corner cases where we could end up with the olh pointing at > the wrong object. > > * object version removal > > 1. Mark olh as it will potentially be modified > 2. Update bucket index about object instance removal > 3. Read bucket index op journal > 4. apply ops journal ... > Now the journal might just say something like 'remove object > instance', which means that the olh was pointing at a different object > version. The more interesting case is when the olh pointing at this > specific object version. In this case the journal will say something > like 'first point the olh at version V2, subject to olh is at version > X. Now, remove object instance V1'. > > 5. Trim journal, unmark olh > > > Note about olh marking: The olh mark will create an attr on the olh > that will have an id and a timestamp. There could be multiple marks on > the olh, and the marks should have some expiration, so that operations > that did not really start would be removed after a while. > > > Let me know if that makes sense, or if you have any questions. > > Thanks, > Yehuda > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html