Re: rgw: decoupling the bucket replication log from the bucket index

Sage Weil <sage@xxxxxxxxxxxx> · Mon, 9 Dec 2019 21:44:31 +0000 (UTC)

On Mon, 9 Dec 2019, Casey Bodley wrote:
> The bucket index logs used for multisite replication are currently stored in
> omap on the bucket index shards, along with the rest of the bucket index
> entries. Storing them in the index was a natural choice, because cls_rgw can
> write these log entries atomically when completing a bucket index transaction.
> 
> To replicate a bucket, other zones process each of its bucket index shard logs
> independently, and store sync status markers with their position in each
> shard. This tight coupling between the replication strategy and the bucket's
> sharding scheme is the main challenge to supporting bucket resharding in
> multisite, because shuffling these log entries to a new set of shards would
> invalidate the sync status markers stored in other zones.
> 
> My proposal, then, is to move the replication logs out of bucket index shards
> into a single log per bucket, and extend the consistency model to make up for
> the lack of atomic writes that we get from cls_rgw.
> 
> The existing consistency model for object writes involves a) calling cls_rgw
> to prepare a bucket index transaction, b) writing the object's head to the
> data pool, then c) calling cls_rgw to complete the transaction. Since the
> write in b) is what makes the object visible to GET requests, we can reply to
> the client without waiting for c) to finish. If either b) or c) fails, the
> next bucket listing will find an entry that was prepared but not completed,
> and we'll check whether the head object exists and use the 'dir suggest' call
> to update the bucket index accordingly.
> 
> If we move the replication log to a separate object, we'll need to write to
> that as well before completing the transaction. And when dir suggest finds
> head objects for uncompleted transactions, it can (re)write their replication
> log entries before updating the bucket index. This recovery means that we can
> still reply to the client before writing to the replication log, so the client
> won't see any extra latency.

This makes sense to me!  Just to make sure I understand:

(a) prepare the bucket index txn
(b) update the head
(c) write the replication log entry
(d) clean up the index txn

This means that if we fail after b and the dir_suggest replays, then we 
may get duplicated (c) items.  Does it also mean that we might not 
notice the dropped replication log entry right away?  Or maybe the 
multisite map that tells us which buckets may be dirty means we can check 
those bucket indexes for any possible in-progress transaction?  Otherwise 
we might end up not registring the replication log item until (much) 
later.

This also means that there could be duplicate items in the replication 
log for the same update.

An alternative might be to do steps (a) and (c) in parallel, but then the 
replication log entry might reflect a head update that hasn't updated yet 
(or perhaps never happens), which would make the replication machinery 
more complex.

> This change also gives us the opportunity to move away from omap and the
> challenges associated with trimming. Yehuda wrote cls_fifo in
> https://github.com/ceph/ceph/pull/30797 with the datalog in mind, and that
> could be a good fit for these bucket replication logs as well.

+1

sage
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx