Re: rgw: resharding buckets without blocking write ops

Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> · Wed, 11 Sep 2019 00:20:42 -0700

On Thu, Aug 29, 2019 at 10:00 AM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
>
> sharing a design for feedback. please let me know if you spot any other
> races, issues or optimizations!
>
> current resharding steps:
> 1) copy the 'source' bucket instance into a new 'target' bucket instance
> with a new instance id
> 2) flag all source bucket index shards with RESHARD_IN_PROGRESS
> 3) flag the source bucket instance with RESHARD_IN_PROGRESS
> 4) list all omap entries in the source bucket index shards
> (cls_rgw_bi_list) and write each entry to its target bucket index shard
> (cls_rgw_bi_put)
> 5a) on success: link target bucket instance, delete source bucket index
> shards, delete source bucket instance
> 5b) on failure: reset RESHARD_IN_PROGRESS flag on source bucket index
> shards, delete target bucket index shards, delete target bucket instance
>
> the current blocking strategy is enforced on the source bucket index
> shards. any write operations received by cls_rgw while the
> RESHARD_IN_PROGRESS flag is set are rejected with ERR_BUSY_RESHARDING.
> radosgw handles these errors by waiting/polling until the reshard
> finishes, then it resends the operation to the new target bucket index
> shard.
>
> to avoid blocking write ops during a reshard, we could instead apply
> their bucket index operations to both the source and target bucket index
> shards in parallel. this includes both the 'prepare' op to start the
> transaction, and the asynchronous 'complete' to commit. allowing both
> buckets to mutate during reshard introduces several new races:
>
> I) between steps (2) and (3), radosgw doesn't yet see the
> RESHARD_IN_PROGRESS flag in the bucket instance info, so doesn't know to
> send the extra index operations to the target bucket index shard
>
> II) operations applied on the target bucket index shards could be
> overwritten by the omap entries copied from the source bucket index
> shards in step (4)
>
> III) radosgw sends a 'prepare' op to the source bucket index shard
> before step (2), then sends the async 'complete' op to the source bucket
> index shard after (2). before step (5), this complete op would fail with
> ERR_BUSY_RESHARDING. after step (5), it would fail with ENOENT. since
> the complete is async, and we've already replied to the client, it's too
> late for any recovery
>
> IV) radosgw sends an operation to both the source and target bucket
> index shards that races with (5) and fails with ENOENT on either the
> source shard (5a) or the target shard (5b)
>
>
> introducing a new generation number or 'reshard_epoch' to each bucket
> that increments on a reshard attempt can help to resolve these races. so
> in step (2), the call to cls_rgw_set_bucket_resharding() would also
> increment the bucket index shard's reshard_epoch. similarly, step (3)
> would increment the bucket instance's reshard_epoch.
>
> to resolve the race in (I), cls_rgw would reject bucket index operations
> with a reshard_epoch older than the one stored in the bucket index
> shard. this ERR_BUSY_RESHARDING error would direct radosgw to re-read
> its bucket instance, detect the reshard in progress, and resend the
> operation to both the source and target bucket index shards with the
> updated reshard_epoch
>
> to resolve the race in (II), cls_rgw_bi_put() would have to test whether
> the given key exists before overwriting
>
> the race in (III) is benign, because the 'prepared' entry was reliably
> stored in the source shard before reshard, so we're guaranteed to see a
> copy on the target shard. even though the 'complete' operation isn't
> applied, the dir_suggest mechanism will detect the incomplete
> transaction and repair the index the next time the target bucket is listed
>
> the race in (IV) can be treated as a success if the operation succeeds
> on the target bucket index shard. if it fails on the target shard,
> radosgw needs to re-read the bucket entrypoint and instance to retarget
> the operation
>
>
> one thing this strategy cannot handle is versioned buckets. some index
> operations for versioning (namely cls_rgw_bucket_link_olh and
> cls_rgw_bucket_unlink_instance) involve writes to two or more related
> omap entries. because step (4) copies over single omap entries, it can't
> preserve the consistency of these relationships once we allow mutation.
> so we'd need to stick with the blocking strategy for versioned buckets

In general I think it could work. How about instead of sending
operations to both source and target, extend the bucket index
representation so that it would allow for changes overlays. In general
the idea is that changes will be still sent to the source when
resharding (for part of the process at least), but will be applied as
an overlay, so that the changes for that period can be tracked.
The overlay info would be kept on a separate keys namespace within the
source bucket index. It could be listed separately. Potentially the
original keys could also include the latest overlay information so
that regular listing could still work the same.

Will be roughly like this:

Resharding:
1. Init
 - Set resharding flag on source
 - Set source info on target

Now all writes will still go to the source, however, the objclass will
now write the entries into the overlay. Listing will still go to the
source. The objclass listing operation will now be able to request
latest entries (base + overlay), only base, and only overlay. Note
that overlay might include negative entries.

2. Reshard (I)

Reshard the same as before. Source bi listing will be done on the base
source, so that it doesn't include all latest changes.

3. Reshard (II)

 - Set a flag on all the source bi that will reject all writes to it.
rgw will now need to send writes to the target, however, it will need
to first apply any overlay entry related to that write (if applicable)
before applying it on the target.
 - Listing will be extended to allow iterating on both the source
overlay and the target.
 - iterate over the source bi overlay and try to apply the entries on
the target (will need to verify that target has older entry there).
Target object removals at this point will not completely remove keys,
but rather keep a negative entry so that it would still have the
required info to apply the source overlay.

4. Finalize
 - reset resharding flag on target
- link target
 - iterate over target and squash negative entries
 - delete source bi

Notes:
 - The original resharding process will be modified a bit, but will
still be used for first phase.
 - objclass changes: overlay functionality (writes, listings, bi*),
negative entries
 - rgw core changes: listing to support merge with overlay, bucket
index writes in second phase need to apply relevant overlay key(s).
 - versioning should work
 - reshard cancellation is not complicated, needs to squash overlay,
but can be done without locking

Yehuda

> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx