Re: rgw: draft of multisite resharding design for review

Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> · Tue, 24 Mar 2020 06:11:44 +0200

On Mon, Mar 23, 2020 at 11:37 PM Yehuda Sadeh-Weinraub
<yehuda@xxxxxxxxxx> wrote:
>
> On Mon, Mar 23, 2020 at 8:50 PM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
> >
> >
> >
> > On Mon, Mar 23, 2020 at 1:59 PM Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> wrote:
> >>
> >> Sorry for the late reply, this went under my radar as I was looking
> >> into similar design issues.
> >>
> >> I'm looking into reducing the complexity of dealing with separate
> >> stages for full sync and incremental sync. It seems to me that we can
> >> maybe simplify things by moving that knowledge to the source. Instead
> >> of the target behaves differently on incremental vs full sync, we can
> >> create an api that would have the source either send the bucket
> >> listing, or the bilog entries using a common result structure. The
> >> marker can be used by the source to distinguish the state and to keep
> >> other information.
> >> This will allow reducing complexities on the sync side, We can do
> >> similar stuff for the metadata sync too. Moreover, it will allow later
> >> for defining different data sources that don't have a bilog and their
> >> scheme might be different.
> >> This is just the idea in general. There are a lot of details to clear
> >> and I glossed over some of the complexities. However, what difference
> >> will such a change make to your design?
> >
> >
> > In this resharding design, full- and incremental-sync are dimensioned differently, with full sync being per-bucket and incremental per-bilog-shard. So if you want to abstract them into a single pipe, it would probably need to be one per bucket.
> >
>
> The idea that I had in mind was to have the ability for the source
> send information about changes in the number of active shards at each
> stage. For example, let's say we start with 1 active shard (for full
> sync), then switch to 17 (incremental), then reshard to 23:
>
> Targe -> Source: init
> S -> Target: {
>  now = {
>     num_shards = 1,
>    markers=  [ ... ],
> }
>  next = {
>   num_shards = 17,
>   markers = [ ... ],
> }
> }
>
> S -> T: fetch
> T -> S: data
> ...
> S -> T: fetch
> T -> S: complete, next!

Sorry, got these reversed (and confusing). Should be:

T -> S: fetch
T <- S: data
...
T -> S: fetch
T <- S: complete, next!

>
> Then we'll move to the next stage and start processing the new number
> of shards with the appropriate markers. When a reshard happens the
> target will send back a notice that now next.num_shards=17 with the
> corresponding markers. The next stage will only start after all the
> previous stage shards have been completed.
>
>
> > I like the idea though. It definitely could simplify sync in the long run, but in the short term, we would still need to sync from octopus source zones. So a lot of the complexity would come from backward compatibility.
> >
>
> For backward compatibility we can create a module on the target side
> that would return the needed information for the source if the source
> doesn't support the new api.
>
> Yehuda
>
>
> Yehuda
>
>
> >>
> >> Yehuda
> >>
> >> On Thu, Mar 12, 2020 at 9:16 PM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
> >> >
> >> > One option would be to change the mapping in choose_oid() so that it
> >> > doesn't spread bucket shards. But doing this in a way that's
> >> > backward-compatible with existing datalogs would be complicated and
> >> > probably require a lot of extra io (for example, writing all datalog
> >> > entries on the 'wrong' shard to the error repo of the new shard so they
> >> > get retried there). This spreading is an important scaling property,
> >> > allowing gateways to share and parallelize a sync workload that's
> >> > clustered around a small number of buckets.
> >> >
> >> > So I think the better option is to minimize this bucket-wide
> >> > coordination, and provide the locking/atomicity necessary to do it
> >> > across datalog shards. The common case (incremental sync of a bucket
> >> > shard) should require no synchronization, so we should continue to track
> >> > that incremental sync status in per-bucket-shard objects instead of
> >> > combining them into a single per-bucket object.
> >> >
> >> > We do still need some per-bucket status to track the full sync progress
> >> > and the current incremental sync generation.
> >> >
> >> > If the bucket status is in full sync, we can use a cls_lock on that
> >> > bucket status object to allow one shard to run the full sync, while
> >> > shards that fail to get the lock will go to the error repo for retry.
> >> > Once full sync completes, shards can go directly to incremental sync
> >> > without a lock.
> >> >
> >> > To coordinate the transitions across generations (reshard events), we
> >> > need some new logic when incremental sync on one shard reaches the end
> >> > of its log. Instead of cls_lock here, we can use cls_version to do a
> >> > read-modify-write of the bucket status object. This will either add our
> >> > shard to the set of completed shards for the current generation, or (if
> >> > all other shards completed already) increment that generation number so
> >> > we can start processing incremental sync on shards from that generation.
> >> >
> >> > This also modifies the handling of datalog entries for generations that
> >> > are newer than the current generation in the bucket status. In the
> >> > previous design draft, we would process these inline by running
> >> > incremental sync on all shards of the current generation - but this
> >> > would require coordination across datalog shards. So instead, each
> >> > unfinished shard in the bucket status can be written to the
> >> > corresponding datalog shard's error repo so it gets retried there. And
> >> > once these retries are successful, they will drive the transition to the
> >> > next generation and allow the original datalog entry to retry and make
> >> > progress.
> >> >
> >> > Does this sound reasonable? I'll start preparing an update to the design
> >> > doc.
> >> >
> >> > On 3/11/20 11:53 AM, Casey Bodley wrote:
> >> > > Well shoot. It turns out that some critical pieces of this design were
> >> > > based on a flawed assumption about the datalog; namely that all
> >> > > changes for a particular bucket would map to the same datalog shard. I
> >> > > now see that writes to different shards of a bucket are mapped to
> >> > > different shards in RGWDataChangesLog::choose_oid().
> >> > >
> >> > > We use cls_locks on these datalog shards so that several gateways can
> >> > > sync the shards independently. But this isn't compatible with several
> >> > > elements of this resharding design that require bucket-wide coordination:
> >> > >
> >> > > * changing bucket full sync from per-bucket-shard to per-bucket
> >> > >
> >> > > * moving bucket sync status from per-bucket-shard to per-bucket
> >> > >
> >> > > * waiting for incremental sync of each bucket shard to complete before
> >> > > advancing to the next log generation
> >> > >
> >> > >
> >> > > On 2/25/20 2:53 PM, Casey Bodley wrote:
> >> > >> I opened a pull request https://github.com/ceph/ceph/pull/33539 with
> >> > >> a design doc for dynamic resharding in multisite. Please review and
> >> > >> give feedback, either here or in PR comments!
> >> > >>
> >> > _______________________________________________
> >> > Dev mailing list -- dev@xxxxxxx
> >> > To unsubscribe send an email to dev-leave@xxxxxxx
> >> >
> >>
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx