Hi Casey,
Sorry for the slow reply, too many things going on at once and the days
slipped by faster than I realized! So overall I get the idea behind
this, but I keep finding myself worried that it's trading one problem
(blocking writes during reshard) for others (complexity, more metadata
reads/writes, incompatibility with versioned buckets, etc). I'm not
sure I have much to add at this point. I wonder if there's some kind of
hybrid of (mostly) static hash-based sharding and range-based sharding
we could do that would let us avoid blocking writes to all shards and
instead only block writes to smaller selections of the DB for shorter
periods of time. That would be a much bigger change but maybe would
sort of fall in line with some of the changes Matt has talked about for
preserving ordering for faster bucket listing? Might be totally
infeasible, but that's sort of the direction my brain headed after
reading your proposal.
Anyway, that's it for now, but I'll try to keep thinking about it in the
background.
Mark
On 8/29/19 11:59 AM, Casey Bodley wrote:
sharing a design for feedback. please let me know if you spot any
other races, issues or optimizations!
current resharding steps:
1) copy the 'source' bucket instance into a new 'target' bucket
instance with a new instance id
2) flag all source bucket index shards with RESHARD_IN_PROGRESS
3) flag the source bucket instance with RESHARD_IN_PROGRESS
4) list all omap entries in the source bucket index shards
(cls_rgw_bi_list) and write each entry to its target bucket index
shard (cls_rgw_bi_put)
5a) on success: link target bucket instance, delete source bucket
index shards, delete source bucket instance
5b) on failure: reset RESHARD_IN_PROGRESS flag on source bucket index
shards, delete target bucket index shards, delete target bucket instance
the current blocking strategy is enforced on the source bucket index
shards. any write operations received by cls_rgw while the
RESHARD_IN_PROGRESS flag is set are rejected with ERR_BUSY_RESHARDING.
radosgw handles these errors by waiting/polling until the reshard
finishes, then it resends the operation to the new target bucket index
shard.
to avoid blocking write ops during a reshard, we could instead apply
their bucket index operations to both the source and target bucket
index shards in parallel. this includes both the 'prepare' op to start
the transaction, and the asynchronous 'complete' to commit. allowing
both buckets to mutate during reshard introduces several new races:
I) between steps (2) and (3), radosgw doesn't yet see the
RESHARD_IN_PROGRESS flag in the bucket instance info, so doesn't know
to send the extra index operations to the target bucket index shard
II) operations applied on the target bucket index shards could be
overwritten by the omap entries copied from the source bucket index
shards in step (4)
III) radosgw sends a 'prepare' op to the source bucket index shard
before step (2), then sends the async 'complete' op to the source
bucket index shard after (2). before step (5), this complete op would
fail with ERR_BUSY_RESHARDING. after step (5), it would fail with
ENOENT. since the complete is async, and we've already replied to the
client, it's too late for any recovery
IV) radosgw sends an operation to both the source and target bucket
index shards that races with (5) and fails with ENOENT on either the
source shard (5a) or the target shard (5b)
introducing a new generation number or 'reshard_epoch' to each bucket
that increments on a reshard attempt can help to resolve these races.
so in step (2), the call to cls_rgw_set_bucket_resharding() would also
increment the bucket index shard's reshard_epoch. similarly, step (3)
would increment the bucket instance's reshard_epoch.
to resolve the race in (I), cls_rgw would reject bucket index
operations with a reshard_epoch older than the one stored in the
bucket index shard. this ERR_BUSY_RESHARDING error would direct
radosgw to re-read its bucket instance, detect the reshard in
progress, and resend the operation to both the source and target
bucket index shards with the updated reshard_epoch
to resolve the race in (II), cls_rgw_bi_put() would have to test
whether the given key exists before overwriting
the race in (III) is benign, because the 'prepared' entry was reliably
stored in the source shard before reshard, so we're guaranteed to see
a copy on the target shard. even though the 'complete' operation isn't
applied, the dir_suggest mechanism will detect the incomplete
transaction and repair the index the next time the target bucket is
listed
the race in (IV) can be treated as a success if the operation succeeds
on the target bucket index shard. if it fails on the target shard,
radosgw needs to re-read the bucket entrypoint and instance to
retarget the operation
one thing this strategy cannot handle is versioned buckets. some index
operations for versioning (namely cls_rgw_bucket_link_olh and
cls_rgw_bucket_unlink_instance) involve writes to two or more related
omap entries. because step (4) copies over single omap entries, it
can't preserve the consistency of these relationships once we allow
mutation. so we'd need to stick with the blocking strategy for
versioned buckets
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx