Re: rgw: decoupling the bucket replication log from the bucket index

Yehuda Sadeh-Weinraub <ysadehwe@xxxxxxxxxx> · Tue, 10 Dec 2019 09:04:09 +0200

On Mon, Dec 9, 2019 at 11:44 PM Sage Weil <sage@xxxxxxxxxxxx> wrote:
>
> On Mon, 9 Dec 2019, Casey Bodley wrote:
> > The bucket index logs used for multisite replication are currently stored in
> > omap on the bucket index shards, along with the rest of the bucket index
> > entries. Storing them in the index was a natural choice, because cls_rgw can
> > write these log entries atomically when completing a bucket index transaction.
> >
> > To replicate a bucket, other zones process each of its bucket index shard logs
> > independently, and store sync status markers with their position in each
> > shard. This tight coupling between the replication strategy and the bucket's
> > sharding scheme is the main challenge to supporting bucket resharding in
> > multisite, because shuffling these log entries to a new set of shards would
> > invalidate the sync status markers stored in other zones.
> >
> > My proposal, then, is to move the replication logs out of bucket index shards
> > into a single log per bucket, and extend the consistency model to make up for
> > the lack of atomic writes that we get from cls_rgw.
> >
> > The existing consistency model for object writes involves a) calling cls_rgw
> > to prepare a bucket index transaction, b) writing the object's head to the
> > data pool, then c) calling cls_rgw to complete the transaction. Since the
> > write in b) is what makes the object visible to GET requests, we can reply to
> > the client without waiting for c) to finish. If either b) or c) fails, the
> > next bucket listing will find an entry that was prepared but not completed,
> > and we'll check whether the head object exists and use the 'dir suggest' call
> > to update the bucket index accordingly.
> >
> > If we move the replication log to a separate object, we'll need to write to
> > that as well before completing the transaction. And when dir suggest finds
> > head objects for uncompleted transactions, it can (re)write their replication
> > log entries before updating the bucket index. This recovery means that we can
> > still reply to the client before writing to the replication log, so the client
> > won't see any extra latency.
>
> This makes sense to me!  Just to make sure I understand:
>
> (a) prepare the bucket index txn
> (b) update the head
> (c) write the replication log entry
> (d) clean up the index txn
>
> This means that if we fail after b and the dir_suggest replays, then we
> may get duplicated (c) items.  Does it also mean that we might not
> notice the dropped replication log entry right away?  Or maybe the
> multisite map that tells us which buckets may be dirty means we can check
> those bucket indexes for any possible in-progress transaction?  Otherwise
> we might end up not registring the replication log item until (much)
> later.

I wouldn't be worried about duplicate items, in the worst case we'd
try to sync the same entry twice, but we'd identify it as already
existing (same would work for deletes).
There is no efficient way currently that would let us check for
in-flight transactions at the bucket index.

>
> This also means that there could be duplicate items in the replication
> log for the same update.
>
> An alternative might be to do steps (a) and (c) in parallel, but then the
> replication log entry might reflect a head update that hasn't updated yet
> (or perhaps never happens), which would make the replication machinery
> more complex.

I'm not sure how that could be solved, would be inherently racy.
Specifically there'd be issue with deletes that would be hard to solve
without introducing tombstones.

Yehuda

>
> > This change also gives us the opportunity to move away from omap and the
> > challenges associated with trimming. Yehuda wrote cls_fifo in
> > https://github.com/ceph/ceph/pull/30797 with the datalog in mind, and that
> > could be a good fit for these bucket replication logs as well.
>
> +1
>
> sage
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
>
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx