On Mon, Mar 23, 2020 at 11:37 PM Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> wrote: > > On Mon, Mar 23, 2020 at 8:50 PM Casey Bodley <cbodley@xxxxxxxxxx> wrote: > > > > > > > > On Mon, Mar 23, 2020 at 1:59 PM Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> wrote: > >> > >> Sorry for the late reply, this went under my radar as I was looking > >> into similar design issues. > >> > >> I'm looking into reducing the complexity of dealing with separate > >> stages for full sync and incremental sync. It seems to me that we can > >> maybe simplify things by moving that knowledge to the source. Instead > >> of the target behaves differently on incremental vs full sync, we can > >> create an api that would have the source either send the bucket > >> listing, or the bilog entries using a common result structure. The > >> marker can be used by the source to distinguish the state and to keep > >> other information. > >> This will allow reducing complexities on the sync side, We can do > >> similar stuff for the metadata sync too. Moreover, it will allow later > >> for defining different data sources that don't have a bilog and their > >> scheme might be different. > >> This is just the idea in general. There are a lot of details to clear > >> and I glossed over some of the complexities. However, what difference > >> will such a change make to your design? > > > > > > In this resharding design, full- and incremental-sync are dimensioned differently, with full sync being per-bucket and incremental per-bilog-shard. So if you want to abstract them into a single pipe, it would probably need to be one per bucket. > > > > The idea that I had in mind was to have the ability for the source > send information about changes in the number of active shards at each > stage. For example, let's say we start with 1 active shard (for full > sync), then switch to 17 (incremental), then reshard to 23: > > Targe -> Source: init > S -> Target: { > now = { > num_shards = 1, > markers= [ ... ], > } > next = { > num_shards = 17, > markers = [ ... ], > } > } > > S -> T: fetch > T -> S: data > ... > S -> T: fetch > T -> S: complete, next! Sorry, got these reversed (and confusing). Should be: T -> S: fetch T <- S: data ... T -> S: fetch T <- S: complete, next! > > Then we'll move to the next stage and start processing the new number > of shards with the appropriate markers. When a reshard happens the > target will send back a notice that now next.num_shards=17 with the > corresponding markers. The next stage will only start after all the > previous stage shards have been completed. > > > > I like the idea though. It definitely could simplify sync in the long run, but in the short term, we would still need to sync from octopus source zones. So a lot of the complexity would come from backward compatibility. > > > > For backward compatibility we can create a module on the target side > that would return the needed information for the source if the source > doesn't support the new api. > > Yehuda > > > Yehuda > > > >> > >> Yehuda > >> > >> On Thu, Mar 12, 2020 at 9:16 PM Casey Bodley <cbodley@xxxxxxxxxx> wrote: > >> > > >> > One option would be to change the mapping in choose_oid() so that it > >> > doesn't spread bucket shards. But doing this in a way that's > >> > backward-compatible with existing datalogs would be complicated and > >> > probably require a lot of extra io (for example, writing all datalog > >> > entries on the 'wrong' shard to the error repo of the new shard so they > >> > get retried there). This spreading is an important scaling property, > >> > allowing gateways to share and parallelize a sync workload that's > >> > clustered around a small number of buckets. > >> > > >> > So I think the better option is to minimize this bucket-wide > >> > coordination, and provide the locking/atomicity necessary to do it > >> > across datalog shards. The common case (incremental sync of a bucket > >> > shard) should require no synchronization, so we should continue to track > >> > that incremental sync status in per-bucket-shard objects instead of > >> > combining them into a single per-bucket object. > >> > > >> > We do still need some per-bucket status to track the full sync progress > >> > and the current incremental sync generation. > >> > > >> > If the bucket status is in full sync, we can use a cls_lock on that > >> > bucket status object to allow one shard to run the full sync, while > >> > shards that fail to get the lock will go to the error repo for retry. > >> > Once full sync completes, shards can go directly to incremental sync > >> > without a lock. > >> > > >> > To coordinate the transitions across generations (reshard events), we > >> > need some new logic when incremental sync on one shard reaches the end > >> > of its log. Instead of cls_lock here, we can use cls_version to do a > >> > read-modify-write of the bucket status object. This will either add our > >> > shard to the set of completed shards for the current generation, or (if > >> > all other shards completed already) increment that generation number so > >> > we can start processing incremental sync on shards from that generation. > >> > > >> > This also modifies the handling of datalog entries for generations that > >> > are newer than the current generation in the bucket status. In the > >> > previous design draft, we would process these inline by running > >> > incremental sync on all shards of the current generation - but this > >> > would require coordination across datalog shards. So instead, each > >> > unfinished shard in the bucket status can be written to the > >> > corresponding datalog shard's error repo so it gets retried there. And > >> > once these retries are successful, they will drive the transition to the > >> > next generation and allow the original datalog entry to retry and make > >> > progress. > >> > > >> > Does this sound reasonable? I'll start preparing an update to the design > >> > doc. > >> > > >> > On 3/11/20 11:53 AM, Casey Bodley wrote: > >> > > Well shoot. It turns out that some critical pieces of this design were > >> > > based on a flawed assumption about the datalog; namely that all > >> > > changes for a particular bucket would map to the same datalog shard. I > >> > > now see that writes to different shards of a bucket are mapped to > >> > > different shards in RGWDataChangesLog::choose_oid(). > >> > > > >> > > We use cls_locks on these datalog shards so that several gateways can > >> > > sync the shards independently. But this isn't compatible with several > >> > > elements of this resharding design that require bucket-wide coordination: > >> > > > >> > > * changing bucket full sync from per-bucket-shard to per-bucket > >> > > > >> > > * moving bucket sync status from per-bucket-shard to per-bucket > >> > > > >> > > * waiting for incremental sync of each bucket shard to complete before > >> > > advancing to the next log generation > >> > > > >> > > > >> > > On 2/25/20 2:53 PM, Casey Bodley wrote: > >> > >> I opened a pull request https://github.com/ceph/ceph/pull/33539 with > >> > >> a design doc for dynamic resharding in multisite. Please review and > >> > >> give feedback, either here or in PR comments! > >> > >> > >> > _______________________________________________ > >> > Dev mailing list -- dev@xxxxxxx > >> > To unsubscribe send an email to dev-leave@xxxxxxx > >> > > >> _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx