Re: rgw: decoupling the bucket replication log from the bucket index

Casey Bodley <cbodley@xxxxxxxxxx> · Tue, 10 Dec 2019 09:14:52 -0500

On 12/9/19 4:44 PM, Sage Weil wrote:
On Mon, 9 Dec 2019, Casey Bodley wrote:
The bucket index logs used for multisite replication are currently stored in
omap on the bucket index shards, along with the rest of the bucket index
entries. Storing them in the index was a natural choice, because cls_rgw can
write these log entries atomically when completing a bucket index transaction.

To replicate a bucket, other zones process each of its bucket index shard logs
independently, and store sync status markers with their position in each
shard. This tight coupling between the replication strategy and the bucket's
sharding scheme is the main challenge to supporting bucket resharding in
multisite, because shuffling these log entries to a new set of shards would
invalidate the sync status markers stored in other zones.

My proposal, then, is to move the replication logs out of bucket index shards
into a single log per bucket, and extend the consistency model to make up for
the lack of atomic writes that we get from cls_rgw.

The existing consistency model for object writes involves a) calling cls_rgw
to prepare a bucket index transaction, b) writing the object's head to the
data pool, then c) calling cls_rgw to complete the transaction. Since the
write in b) is what makes the object visible to GET requests, we can reply to
the client without waiting for c) to finish. If either b) or c) fails, the
next bucket listing will find an entry that was prepared but not completed,
and we'll check whether the head object exists and use the 'dir suggest' call
to update the bucket index accordingly.

If we move the replication log to a separate object, we'll need to write to
that as well before completing the transaction. And when dir suggest finds
head objects for uncompleted transactions, it can (re)write their replication
log entries before updating the bucket index. This recovery means that we can
still reply to the client before writing to the replication log, so the client
won't see any extra latency.
This makes sense to me!  Just to make sure I understand:

(a) prepare the bucket index txn
(b) update the head
(c) write the replication log entry
(d) clean up the index txn

This means that if we fail after b and the dir_suggest replays, then we
may get duplicated (c) items.  Does it also mean that we might not
notice the dropped replication log entry right away?  Or maybe the
multisite map that tells us which buckets may be dirty means we can check
those bucket indexes for any possible in-progress transaction?  Otherwise
we might end up not registring the replication log item until (much)
later.

That's true, dir suggest only guarantees that the next bucket listing is 
consistent; if nobody lists the bucket, then this recovery never runs. 
We're in the same situation now, where cls_rgw's 
rgw_dir_suggest_changes() call is writing to the bilog.

This also means that there could be duplicate items in the replication
log for the same update.

An alternative might be to do steps (a) and (c) in parallel, but then the
replication log entry might reflect a head update that hasn't updated yet
(or perhaps never happens), which would make the replication machinery
more complex.

This change also gives us the opportunity to move away from omap and the
challenges associated with trimming. Yehuda wrote cls_fifo in
https://github.com/ceph/ceph/pull/30797 with the datalog in mind, and that
could be a good fit for these bucket replication logs as well.
+1

sage
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx