Re: object versioning

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Aug 18, 2014 at 5:22 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> On Thu, 14 Aug 2014, Yehuda Sadeh wrote:
>> >> The current scheme is that we update the bucket index using a 2 phase
>> >> commit, and it follows up on the objects state. So when adding /
>> >> removing an object, we first tell the bucket index to 'prepare' for
>> >> the operation, then do the operation, and eventually we let the bucket
>> >> index know about the completion. For ordering we rely on the pg
>> >> versioning system that gives us insight into the timeline, so that
>> >> when two concurrent operations happen on the same object the bucket
>> >> index can figure out who won and who is dead.
>> >> This system as it is doesn't really work with versioning as we have
>> >> both the olh, and the object instances. This is one of the solutions
>> >> that we came up with:
>> >>
>> >>  - The bucket index will be the source of the truth
>> >>  - The bucket index will serve as an operational log for olh operations
>> >>
>> >> The bucket index will index every object instance in reverse order
>> >> (from new to old). The bucket index will keep entries for deletion
>> >> markers.
>> >> The bucket index will also keep operations journal for olh
>> >> modifications. Each operation in this journal will have an id that
>> >> will be increased monotonically, and that will be tied into current
>> >> olh version. The olh will be modified using idempotent operations that
>> >> will be subject to having its current version smaller than the
>> >> operation id.
>> >> The journal will be used for keeping order, and the entries in the
>> >> journal will serve as a blueprint that the gateways will need to
>> >> follow when applying changes. In order to ensure that operations that
>> >> needed to be complete were done, we'll mark the olh before going to
>> >> the bucket index, so that if the gateway died before completing the
>> >> operation, next time we try to access the object we'll know that we
>> >> need to go to the bucket index and complete the operation.
>> >>
>> >> Things will then work like this:
>> >
>> > I take it there is also a:
>> >
>> > * object read
>> >
>> > 1. look at olh
>> > 2. if marked as pending-modify,
>> >    a. check index for current head version, and use that vaue
>> >    b. if pending-modify is super old and no matching index entry exists,
>> >       remove marker
>> >    b. if index entry does exist, send async op to roll-forward the olh
>> > 3. read referenced object version
>> >
>> > ...and the 'roll-forward' on the olh would be something like
>> >
>> >  cmpxattr pending-modify-$tag == 1
>>
>> I'm not sure we need to this comparison. What really matters is the
>> actual olh version.
>
> Yeah
>
>> >  cmpxattr olh_version == previous v
>>
>> Maybe it should actually be cmpxattr olh_version < new v
>>
>> >  setxattr olh_version = new v
>> >  setxattr head_version = whatever
>> >  rmxattr pending-modify-$tag
>
> But then we also need to rmxattr pending-modify-$tag for all prior
> modifications that are in the index/journal at the time.

Right. The index should provide that list.

>
>> >
>> > This has the side-effect that a hot object will briefly pummel the index.
>> > That is probably fine...
>> >
>> >> * object creation
>> >>
>> >> 1. Create object instance
>> >
>> >  is there a step 0 so that a failed rgw gets garbage collected?
>>
>> What scenario are you worried about? Incomplete operations should be
>> take care of by step (5)
>
> If we fail before 2 then the (partial) object version should get garbage
> collected.

There need to be some mechanism in the index to identify these cases.

>
>> >> 2. Mark olh that it's about to be modified
>> >
>> >  setxattr pending-modify-$tag=1
>> >
>> >> 3. Update bucket index about new object instance
>> >
>> >  omap_setkeys journal_$object_$olhversion_$tag = pending ?
>>
>> Yeah, something along these lines.
>>
>> >
>> >> 4. Read bucket index object op journal
>> >>
>> >> Note that the journal should have at this point an entry that says
>> >> 'point olh to specific object version, subject to olh is at version
>> >> X'.
>> >>
>> >> 5. Apply journal ops
>> >
>> > same as roll-forward event above?  unmark olh in the same op:
>> >
>> >  cmpxattr pending-modify-$tag == 1
>> >  cmpxattr olh_version == $olh_version_old
>> >  setxattr olh_version = $olh_version_new
>> >  setxattr head_version = whatever
>> >  rmxattr pending-modify-$tag
>> >
>> >> 6. Trim journal, unmark olh
>> >
>> > Just trim the journal.
>> >
>> >  call rgw.trim_journal($object, $olh_version_new)
>> >
>> > ...which can remove all prior journal entries too, since the olh is now at
>> > that version (or something higher).
>
> Moving on to the others ops:
>
>> >> * object removal (olh)
>> >>
>> >> 1. Mark olh that it's about to be modified
>
>  setxattr pending-modify-$tagthing
>
>> >> 2. Update bucket index about the new deletion marker
>
>  omap_setkeys ...
>
>> >> 3. Read bucket index object op journal
>> >>
>> >> The journal entry should say something like 'mark olh as removed,
>> >> subject to olh is at version X'
>
>  call rgw.describe_olh_op $bucket $object  ?

Yeah

>
>> >> 4. Apply ops
>
>  cmpxattr olh_version == $olh_version_old
>  setxattr olh_version = $olh_version_new
>  setxattr head_version = whiteout
>  rmxattr pending-modify-$tag (for all pending tags)
>
>> >> 5. Trim journal, unmark olh
>> >>
>> >> Another option is to actually remove the olh, but in this case we'll
>> >> lose the olh versioning. We can in that case use the object
>> >> non-existent state as a check, but that will not be enough as there
>> >> are some corner cases where we could end up with the olh pointing at
>> >> the wrong object.
>
> Yeah, it seems simplest to keep the olh as long as there are object
> versions.
>
>
>> >> * object version removal
>> >>
>> >> 1. Mark olh as it will potentially be modified
>
>  setxattr pending-modify-$tag
>
>> >> 2. Update bucket index about object instance removal
>
>  omap_setkeys ...
>
>> >> 3. Read bucket index op journal
>
>  call rgw.describe_olh_op $bucket $object $tag
>
>> >> 4. apply ops journal ...
>> >> Now the journal might just say something like 'remove object
>> >> instance', which means that the olh was pointing at a different object
>> >> version. The more interesting case is when the olh pointing at this
>> >> specific object version. In this case the journal will say something
>> >> like 'first point the olh at version V2, subject to olh is at version
>> >> X. Now, remove object instance V1'.
>
>  cmpxattr olh_version == $olh_version_old
>  setxattr olh_version = $olh_version_new
>  rmxattr pending-modify-$tag (for all pending tags)
>
> It seems like one could get away with not touching the olh for removing
> old object versions, but I'm not sure it's worth it?
>
>> >> 5. Trim journal, unmark olh
>> >>
>> >>
>> >> Note about olh marking: The olh mark will create an attr on the olh
>> >> that will have an id and a timestamp. There could be multiple marks on
>> >> the olh, and the marks should have some expiration, so that operations
>> >> that did not really start would be removed after a while.
>
> Ah, yeah.  So it's really smoething like
>
>  setxattr pending-modify-$tag = <timestamp>
>
> There is another case here when when all versions get removed.  In that
> case, the final op would just remove the olh entirely.  Later, when we
> recreate the object, the object create would be
>
> 1. write object version
> 2. write to journal
> 3. describe olh op
> 4. create/update olh
> 5. trim journal
>
> ?
>

Sounds good to me.

Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux