Re: rgw: draft of multisite resharding design for review

Matt Benjamin <mbenjami@xxxxxxxxxx> · Thu, 12 Mar 2020 15:26:22 -0400

inline

On Thu, Mar 12, 2020 at 3:16 PM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
>
> One option would be to change the mapping in choose_oid() so that it
> doesn't spread bucket shards. But doing this in a way that's
> backward-compatible with existing datalogs would be complicated and
> probably require a lot of extra io (for example, writing all datalog
> entries on the 'wrong' shard to the error repo of the new shard so they
> get retried there). This spreading is an important scaling property,
> allowing gateways to share and parallelize a sync workload that's
> clustered around a small number of buckets.

I agree.  We know this to be important.

>
> So I think the better option is to minimize this bucket-wide
> coordination, and provide the locking/atomicity necessary to do it
> across datalog shards. The common case (incremental sync of a bucket
> shard) should require no synchronization, so we should continue to track
> that incremental sync status in per-bucket-shard objects instead of
> combining them into a single per-bucket object.
>
> We do still need some per-bucket status to track the full sync progress
> and the current incremental sync generation.
>
> If the bucket status is in full sync, we can use a cls_lock on that
> bucket status object to allow one shard to run the full sync, while
> shards that fail to get the lock will go to the error repo for retry.
> Once full sync completes, shards can go directly to incremental sync
> without a lock.

Yes, this seems reasonable.

>
> To coordinate the transitions across generations (reshard events), we
> need some new logic when incremental sync on one shard reaches the end
> of its log. Instead of cls_lock here, we can use cls_version to do a
> read-modify-write of the bucket status object. This will either add our
> shard to the set of completed shards for the current generation, or (if
> all other shards completed already) increment that generation number so
> we can start processing incremental sync on shards from that generation.
>
> This also modifies the handling of datalog entries for generations that
> are newer than the current generation in the bucket status. In the
> previous design draft, we would process these inline by running
> incremental sync on all shards of the current generation - but this
> would require coordination across datalog shards. So instead, each
> unfinished shard in the bucket status can be written to the
> corresponding datalog shard's error repo so it gets retried there. And
> once these retries are successful, they will drive the transition to the
> next generation and allow the original datalog entry to retry and make
> progress.

ok

>
> Does this sound reasonable? I'll start preparing an update to the design
> doc.

Sounds promising to me.

Matt

>
> On 3/11/20 11:53 AM, Casey Bodley wrote:
> > Well shoot. It turns out that some critical pieces of this design were
> > based on a flawed assumption about the datalog; namely that all
> > changes for a particular bucket would map to the same datalog shard. I
> > now see that writes to different shards of a bucket are mapped to
> > different shards in RGWDataChangesLog::choose_oid().
> >
> > We use cls_locks on these datalog shards so that several gateways can
> > sync the shards independently. But this isn't compatible with several
> > elements of this resharding design that require bucket-wide coordination:
> >
> > * changing bucket full sync from per-bucket-shard to per-bucket
> >
> > * moving bucket sync status from per-bucket-shard to per-bucket
> >
> > * waiting for incremental sync of each bucket shard to complete before
> > advancing to the next log generation
> >
> >
> > On 2/25/20 2:53 PM, Casey Bodley wrote:
> >> I opened a pull request https://github.com/ceph/ceph/pull/33539 with
> >> a design doc for dynamic resharding in multisite. Please review and
> >> give feedback, either here or in PR comments!
> >>
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
>

-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx