Re: rgw: draft of multisite resharding design for review

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



One option would be to change the mapping in choose_oid() so that it doesn't spread bucket shards. But doing this in a way that's backward-compatible with existing datalogs would be complicated and probably require a lot of extra io (for example, writing all datalog entries on the 'wrong' shard to the error repo of the new shard so they get retried there). This spreading is an important scaling property, allowing gateways to share and parallelize a sync workload that's clustered around a small number of buckets.

So I think the better option is to minimize this bucket-wide coordination, and provide the locking/atomicity necessary to do it across datalog shards. The common case (incremental sync of a bucket shard) should require no synchronization, so we should continue to track that incremental sync status in per-bucket-shard objects instead of combining them into a single per-bucket object.

We do still need some per-bucket status to track the full sync progress and the current incremental sync generation.

If the bucket status is in full sync, we can use a cls_lock on that bucket status object to allow one shard to run the full sync, while shards that fail to get the lock will go to the error repo for retry. Once full sync completes, shards can go directly to incremental sync without a lock.

To coordinate the transitions across generations (reshard events), we need some new logic when incremental sync on one shard reaches the end of its log. Instead of cls_lock here, we can use cls_version to do a read-modify-write of the bucket status object. This will either add our shard to the set of completed shards for the current generation, or (if all other shards completed already) increment that generation number so we can start processing incremental sync on shards from that generation.

This also modifies the handling of datalog entries for generations that are newer than the current generation in the bucket status. In the previous design draft, we would process these inline by running incremental sync on all shards of the current generation - but this would require coordination across datalog shards. So instead, each unfinished shard in the bucket status can be written to the corresponding datalog shard's error repo so it gets retried there. And once these retries are successful, they will drive the transition to the next generation and allow the original datalog entry to retry and make progress.

Does this sound reasonable? I'll start preparing an update to the design doc.

On 3/11/20 11:53 AM, Casey Bodley wrote:
Well shoot. It turns out that some critical pieces of this design were based on a flawed assumption about the datalog; namely that all changes for a particular bucket would map to the same datalog shard. I now see that writes to different shards of a bucket are mapped to different shards in RGWDataChangesLog::choose_oid().

We use cls_locks on these datalog shards so that several gateways can sync the shards independently. But this isn't compatible with several elements of this resharding design that require bucket-wide coordination:

* changing bucket full sync from per-bucket-shard to per-bucket

* moving bucket sync status from per-bucket-shard to per-bucket

* waiting for incremental sync of each bucket shard to complete before advancing to the next log generation


On 2/25/20 2:53 PM, Casey Bodley wrote:
I opened a pull request https://github.com/ceph/ceph/pull/33539 with a design doc for dynamic resharding in multisite. Please review and give feedback, either here or in PR comments!

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx



[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux