One option would be to change the mapping in choose_oid() so that it
doesn't spread bucket shards. But doing this in a way that's
backward-compatible with existing datalogs would be complicated and
probably require a lot of extra io (for example, writing all datalog
entries on the 'wrong' shard to the error repo of the new shard so they
get retried there). This spreading is an important scaling property,
allowing gateways to share and parallelize a sync workload that's
clustered around a small number of buckets.
So I think the better option is to minimize this bucket-wide
coordination, and provide the locking/atomicity necessary to do it
across datalog shards. The common case (incremental sync of a bucket
shard) should require no synchronization, so we should continue to track
that incremental sync status in per-bucket-shard objects instead of
combining them into a single per-bucket object.
We do still need some per-bucket status to track the full sync progress
and the current incremental sync generation.
If the bucket status is in full sync, we can use a cls_lock on that
bucket status object to allow one shard to run the full sync, while
shards that fail to get the lock will go to the error repo for retry.
Once full sync completes, shards can go directly to incremental sync
without a lock.
To coordinate the transitions across generations (reshard events), we
need some new logic when incremental sync on one shard reaches the end
of its log. Instead of cls_lock here, we can use cls_version to do a
read-modify-write of the bucket status object. This will either add our
shard to the set of completed shards for the current generation, or (if
all other shards completed already) increment that generation number so
we can start processing incremental sync on shards from that generation.
This also modifies the handling of datalog entries for generations that
are newer than the current generation in the bucket status. In the
previous design draft, we would process these inline by running
incremental sync on all shards of the current generation - but this
would require coordination across datalog shards. So instead, each
unfinished shard in the bucket status can be written to the
corresponding datalog shard's error repo so it gets retried there. And
once these retries are successful, they will drive the transition to the
next generation and allow the original datalog entry to retry and make
progress.
Does this sound reasonable? I'll start preparing an update to the design
doc.
On 3/11/20 11:53 AM, Casey Bodley wrote:
Well shoot. It turns out that some critical pieces of this design were
based on a flawed assumption about the datalog; namely that all
changes for a particular bucket would map to the same datalog shard. I
now see that writes to different shards of a bucket are mapped to
different shards in RGWDataChangesLog::choose_oid().
We use cls_locks on these datalog shards so that several gateways can
sync the shards independently. But this isn't compatible with several
elements of this resharding design that require bucket-wide coordination:
* changing bucket full sync from per-bucket-shard to per-bucket
* moving bucket sync status from per-bucket-shard to per-bucket
* waiting for incremental sync of each bucket shard to complete before
advancing to the next log generation
On 2/25/20 2:53 PM, Casey Bodley wrote:
I opened a pull request https://github.com/ceph/ceph/pull/33539 with
a design doc for dynamic resharding in multisite. Please review and
give feedback, either here or in PR comments!
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx