Re: rgw: draft of multisite resharding design for review

Casey Bodley <cbodley@xxxxxxxxxx> · Thu, 12 Mar 2020 15:15:56 -0400

One option would be to change the mapping in choose_oid() so that it 
doesn't spread bucket shards. But doing this in a way that's 
backward-compatible with existing datalogs would be complicated and 
probably require a lot of extra io (for example, writing all datalog 
entries on the 'wrong' shard to the error repo of the new shard so they 
get retried there). This spreading is an important scaling property, 
allowing gateways to share and parallelize a sync workload that's 
clustered around a small number of buckets.

So I think the better option is to minimize this bucket-wide 
coordination, and provide the locking/atomicity necessary to do it 
across datalog shards. The common case (incremental sync of a bucket 
shard) should require no synchronization, so we should continue to track 
that incremental sync status in per-bucket-shard objects instead of 
combining them into a single per-bucket object.

We do still need some per-bucket status to track the full sync progress 
and the current incremental sync generation.

If the bucket status is in full sync, we can use a cls_lock on that 
bucket status object to allow one shard to run the full sync, while 
shards that fail to get the lock will go to the error repo for retry. 
Once full sync completes, shards can go directly to incremental sync 
without a lock.

To coordinate the transitions across generations (reshard events), we 
need some new logic when incremental sync on one shard reaches the end 
of its log. Instead of cls_lock here, we can use cls_version to do a 
read-modify-write of the bucket status object. This will either add our 
shard to the set of completed shards for the current generation, or (if 
all other shards completed already) increment that generation number so 
we can start processing incremental sync on shards from that generation.

This also modifies the handling of datalog entries for generations that 
are newer than the current generation in the bucket status. In the 
previous design draft, we would process these inline by running 
incremental sync on all shards of the current generation - but this 
would require coordination across datalog shards. So instead, each 
unfinished shard in the bucket status can be written to the 
corresponding datalog shard's error repo so it gets retried there. And 
once these retries are successful, they will drive the transition to the 
next generation and allow the original datalog entry to retry and make 
progress.

Does this sound reasonable? I'll start preparing an update to the design 
doc.

On 3/11/20 11:53 AM, Casey Bodley wrote:
Well shoot. It turns out that some critical pieces of this design were 
based on a flawed assumption about the datalog; namely that all 
changes for a particular bucket would map to the same datalog shard. I 
now see that writes to different shards of a bucket are mapped to 
different shards in RGWDataChangesLog::choose_oid().

We use cls_locks on these datalog shards so that several gateways can 
sync the shards independently. But this isn't compatible with several 
elements of this resharding design that require bucket-wide coordination:

* changing bucket full sync from per-bucket-shard to per-bucket

* moving bucket sync status from per-bucket-shard to per-bucket

* waiting for incremental sync of each bucket shard to complete before 
advancing to the next log generation

On 2/25/20 2:53 PM, Casey Bodley wrote:
I opened a pull request https://github.com/ceph/ceph/pull/33539 with 
a design doc for dynamic resharding in multisite. Please review and 
give feedback, either here or in PR comments!

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx