On Mon, Aug 18, 2014 at 5:22 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: > On Thu, 14 Aug 2014, Yehuda Sadeh wrote: >> >> The current scheme is that we update the bucket index using a 2 phase >> >> commit, and it follows up on the objects state. So when adding / >> >> removing an object, we first tell the bucket index to 'prepare' for >> >> the operation, then do the operation, and eventually we let the bucket >> >> index know about the completion. For ordering we rely on the pg >> >> versioning system that gives us insight into the timeline, so that >> >> when two concurrent operations happen on the same object the bucket >> >> index can figure out who won and who is dead. >> >> This system as it is doesn't really work with versioning as we have >> >> both the olh, and the object instances. This is one of the solutions >> >> that we came up with: >> >> >> >> - The bucket index will be the source of the truth >> >> - The bucket index will serve as an operational log for olh operations >> >> >> >> The bucket index will index every object instance in reverse order >> >> (from new to old). The bucket index will keep entries for deletion >> >> markers. >> >> The bucket index will also keep operations journal for olh >> >> modifications. Each operation in this journal will have an id that >> >> will be increased monotonically, and that will be tied into current >> >> olh version. The olh will be modified using idempotent operations that >> >> will be subject to having its current version smaller than the >> >> operation id. >> >> The journal will be used for keeping order, and the entries in the >> >> journal will serve as a blueprint that the gateways will need to >> >> follow when applying changes. In order to ensure that operations that >> >> needed to be complete were done, we'll mark the olh before going to >> >> the bucket index, so that if the gateway died before completing the >> >> operation, next time we try to access the object we'll know that we >> >> need to go to the bucket index and complete the operation. >> >> >> >> Things will then work like this: >> > >> > I take it there is also a: >> > >> > * object read >> > >> > 1. look at olh >> > 2. if marked as pending-modify, >> > a. check index for current head version, and use that vaue >> > b. if pending-modify is super old and no matching index entry exists, >> > remove marker >> > b. if index entry does exist, send async op to roll-forward the olh >> > 3. read referenced object version >> > >> > ...and the 'roll-forward' on the olh would be something like >> > >> > cmpxattr pending-modify-$tag == 1 >> >> I'm not sure we need to this comparison. What really matters is the >> actual olh version. > > Yeah > >> > cmpxattr olh_version == previous v >> >> Maybe it should actually be cmpxattr olh_version < new v >> >> > setxattr olh_version = new v >> > setxattr head_version = whatever >> > rmxattr pending-modify-$tag > > But then we also need to rmxattr pending-modify-$tag for all prior > modifications that are in the index/journal at the time. Right. The index should provide that list. > >> > >> > This has the side-effect that a hot object will briefly pummel the index. >> > That is probably fine... >> > >> >> * object creation >> >> >> >> 1. Create object instance >> > >> > is there a step 0 so that a failed rgw gets garbage collected? >> >> What scenario are you worried about? Incomplete operations should be >> take care of by step (5) > > If we fail before 2 then the (partial) object version should get garbage > collected. There need to be some mechanism in the index to identify these cases. > >> >> 2. Mark olh that it's about to be modified >> > >> > setxattr pending-modify-$tag=1 >> > >> >> 3. Update bucket index about new object instance >> > >> > omap_setkeys journal_$object_$olhversion_$tag = pending ? >> >> Yeah, something along these lines. >> >> > >> >> 4. Read bucket index object op journal >> >> >> >> Note that the journal should have at this point an entry that says >> >> 'point olh to specific object version, subject to olh is at version >> >> X'. >> >> >> >> 5. Apply journal ops >> > >> > same as roll-forward event above? unmark olh in the same op: >> > >> > cmpxattr pending-modify-$tag == 1 >> > cmpxattr olh_version == $olh_version_old >> > setxattr olh_version = $olh_version_new >> > setxattr head_version = whatever >> > rmxattr pending-modify-$tag >> > >> >> 6. Trim journal, unmark olh >> > >> > Just trim the journal. >> > >> > call rgw.trim_journal($object, $olh_version_new) >> > >> > ...which can remove all prior journal entries too, since the olh is now at >> > that version (or something higher). > > Moving on to the others ops: > >> >> * object removal (olh) >> >> >> >> 1. Mark olh that it's about to be modified > > setxattr pending-modify-$tagthing > >> >> 2. Update bucket index about the new deletion marker > > omap_setkeys ... > >> >> 3. Read bucket index object op journal >> >> >> >> The journal entry should say something like 'mark olh as removed, >> >> subject to olh is at version X' > > call rgw.describe_olh_op $bucket $object ? Yeah > >> >> 4. Apply ops > > cmpxattr olh_version == $olh_version_old > setxattr olh_version = $olh_version_new > setxattr head_version = whiteout > rmxattr pending-modify-$tag (for all pending tags) > >> >> 5. Trim journal, unmark olh >> >> >> >> Another option is to actually remove the olh, but in this case we'll >> >> lose the olh versioning. We can in that case use the object >> >> non-existent state as a check, but that will not be enough as there >> >> are some corner cases where we could end up with the olh pointing at >> >> the wrong object. > > Yeah, it seems simplest to keep the olh as long as there are object > versions. > > >> >> * object version removal >> >> >> >> 1. Mark olh as it will potentially be modified > > setxattr pending-modify-$tag > >> >> 2. Update bucket index about object instance removal > > omap_setkeys ... > >> >> 3. Read bucket index op journal > > call rgw.describe_olh_op $bucket $object $tag > >> >> 4. apply ops journal ... >> >> Now the journal might just say something like 'remove object >> >> instance', which means that the olh was pointing at a different object >> >> version. The more interesting case is when the olh pointing at this >> >> specific object version. In this case the journal will say something >> >> like 'first point the olh at version V2, subject to olh is at version >> >> X. Now, remove object instance V1'. > > cmpxattr olh_version == $olh_version_old > setxattr olh_version = $olh_version_new > rmxattr pending-modify-$tag (for all pending tags) > > It seems like one could get away with not touching the olh for removing > old object versions, but I'm not sure it's worth it? > >> >> 5. Trim journal, unmark olh >> >> >> >> >> >> Note about olh marking: The olh mark will create an attr on the olh >> >> that will have an id and a timestamp. There could be multiple marks on >> >> the olh, and the marks should have some expiration, so that operations >> >> that did not really start would be removed after a while. > > Ah, yeah. So it's really smoething like > > setxattr pending-modify-$tag = <timestamp> > > There is another case here when when all versions get removed. In that > case, the final op would just remove the olh entirely. Later, when we > recreate the object, the object create would be > > 1. write object version > 2. write to journal > 3. describe olh op > 4. create/update olh > 5. trim journal > > ? > Sounds good to me. Yehuda -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html