Re: rgw: resharding buckets without blocking write ops

Mark Nelson <mnelson@xxxxxxxxxx> · Tue, 10 Sep 2019 10:47:15 -0500

Ok, going to spit-ball an idea I had while falling asleep last night.  
I'm sure there are tons of corner cases and issues I haven't thought of, 
but maybe it will lead to more discussion so I'll go out on a limb and 
post it in the hopes that maybe it can get folks creative juices 
flowing.  The overall idea is that RGW partitions shards into ranges 
like in typical range-based sharding approaches, but in each range we 
have a number of shards.  Objects are distributed over those shards in a 
(hand-wavy) round-robin-like fashion.  This is done to give us some 
degree of parallelism for writes/deletes within a given range.  The 
number of shards per pool remains (mostly?) static.  The range-pool 
dynamically splits based on the total number of entries which results in 
the associated shards moving half their keys into shards in the new 
range pool. The goal behind all of this would be to limit the impact of 
splits (now range-pool splits) to only a section of the overall bucket 
and cap the number of objects participating in a split.  A second 
simultaneous goal is to allow for write parallelism within a given range 
while not significantly hurting bucket listing times (or if we must, at 
least cap the pain).

Example (using decimal naming for easy of explanation):

Object Name: "15"
Existing range-shard pools: "0-2, 3-6, 7-20"
shards per pool: 8

Reading:
1) Each RGW server has a consistent view of the range-shard pools.
2) Determine the range pool associated with the object.
3) Use hand-wavy-round-robin technique and shards-per-pool count to pick 
the shard.
4) Follow normal RGW procedure to read the object from the shard.
Writing:

1) Each RGW server has a consistent view of the range-shard pools.
2) Determine the range pool associated with the object.
3) Use hand-wavy-round-robin technique and shards-per-pool count to pick 
the shard.
4) Follow normal RGW procedure to write to the shard.

Splitting:

Say that placing object "15" should result in a range split.

1) Lock the bucket and pool "7-20".
2) Modify pool 7-20 into 14-20
3) Create new pool 7-13
4) spawn redistribution of the pool map to RGW servers
4) lock pool 7-13
5) unlock the bucket
6) create new shards for pool 7-13
7a) if handy-wavy-RR provides psuedo-random placement: <placeholder, 
maybe fetch less keys than the whole set>
7b) if hand-wavy-RR provides no uniformity: fetch all keys, sort them, 
and move the last half from 7-13 to 14-20.
8) unlock pools 7-13 and 14-20.

Listing:

1) Find the range pool(s) for the requested range
2a) if hand-wavy-RR provides pseudo-random placement: <placeholder, 
maybe fetch less keys than the whole set>
2b) if hand-wavy-RR provides no uniformity: fetch all keys from each 
shard, sort them, return first N and throw away the rest (like we do now).

Questions:

1) RR based on the name doesn't guarantee uniformity.
2) Halton sequence gets us somewhere between random and uniform placement?
2) Hashing would get us psuedo-random placement.
3) Alternate technique to preserve ordering with deterministic placement 
(Eric? Matt?)
4) How does versioning work in this scheme?
5) What else have I screwed up?

Mark

On 9/9/19 12:33 PM, Mark Nelson wrote:
Hi Casey,

Sorry for the slow reply, too many things going on at once and the 
days slipped by faster than I realized!  So overall I get the idea 
behind this, but I keep finding myself worried that it's trading one 
problem (blocking writes during reshard) for others (complexity, more 
metadata reads/writes, incompatibility with versioned buckets, etc).  
I'm not sure I have much to add at this point.  I wonder if there's 
some kind of hybrid of (mostly) static hash-based sharding and 
range-based sharding we could do that would let us avoid blocking 
writes to all shards and instead only block writes to smaller 
selections of the DB for shorter periods of time.  That would be a 
much bigger change but maybe would sort of fall in line with some of 
the changes Matt has talked about for preserving ordering for faster 
bucket listing?  Might be totally infeasible, but that's sort of the 
direction my brain headed after reading your proposal.

Anyway, that's it for now, but I'll try to keep thinking about it in 
the background.

Mark

On 8/29/19 11:59 AM, Casey Bodley wrote:
sharing a design for feedback. please let me know if you spot any 
other races, issues or optimizations!

current resharding steps:
1) copy the 'source' bucket instance into a new 'target' bucket 
instance with a new instance id
2) flag all source bucket index shards with RESHARD_IN_PROGRESS
3) flag the source bucket instance with RESHARD_IN_PROGRESS
4) list all omap entries in the source bucket index shards 
(cls_rgw_bi_list) and write each entry to its target bucket index 
shard (cls_rgw_bi_put)
5a) on success: link target bucket instance, delete source bucket 
index shards, delete source bucket instance
5b) on failure: reset RESHARD_IN_PROGRESS flag on source bucket index 
shards, delete target bucket index shards, delete target bucket instance

the current blocking strategy is enforced on the source bucket index 
shards. any write operations received by cls_rgw while the 
RESHARD_IN_PROGRESS flag is set are rejected with 
ERR_BUSY_RESHARDING. radosgw handles these errors by waiting/polling 
until the reshard finishes, then it resends the operation to the new 
target bucket index shard.

to avoid blocking write ops during a reshard, we could instead apply 
their bucket index operations to both the source and target bucket 
index shards in parallel. this includes both the 'prepare' op to 
start the transaction, and the asynchronous 'complete' to commit. 
allowing both buckets to mutate during reshard introduces several new 
races:

I) between steps (2) and (3), radosgw doesn't yet see the 
RESHARD_IN_PROGRESS flag in the bucket instance info, so doesn't know 
to send the extra index operations to the target bucket index shard

II) operations applied on the target bucket index shards could be 
overwritten by the omap entries copied from the source bucket index 
shards in step (4)

III) radosgw sends a 'prepare' op to the source bucket index shard 
before step (2), then sends the async 'complete' op to the source 
bucket index shard after (2). before step (5), this complete op would 
fail with ERR_BUSY_RESHARDING. after step (5), it would fail with 
ENOENT. since the complete is async, and we've already replied to the 
client, it's too late for any recovery

IV) radosgw sends an operation to both the source and target bucket 
index shards that races with (5) and fails with ENOENT on either the 
source shard (5a) or the target shard (5b)

introducing a new generation number or 'reshard_epoch' to each bucket 
that increments on a reshard attempt can help to resolve these races. 
so in step (2), the call to cls_rgw_set_bucket_resharding() would 
also increment the bucket index shard's reshard_epoch. similarly, 
step (3) would increment the bucket instance's reshard_epoch.

to resolve the race in (I), cls_rgw would reject bucket index 
operations with a reshard_epoch older than the one stored in the 
bucket index shard. this ERR_BUSY_RESHARDING error would direct 
radosgw to re-read its bucket instance, detect the reshard in 
progress, and resend the operation to both the source and target 
bucket index shards with the updated reshard_epoch

to resolve the race in (II), cls_rgw_bi_put() would have to test 
whether the given key exists before overwriting

the race in (III) is benign, because the 'prepared' entry was 
reliably stored in the source shard before reshard, so we're 
guaranteed to see a copy on the target shard. even though the 
'complete' operation isn't applied, the dir_suggest mechanism will 
detect the incomplete transaction and repair the index the next time 
the target bucket is listed

the race in (IV) can be treated as a success if the operation 
succeeds on the target bucket index shard. if it fails on the target 
shard, radosgw needs to re-read the bucket entrypoint and instance to 
retarget the operation

one thing this strategy cannot handle is versioned buckets. some 
index operations for versioning (namely cls_rgw_bucket_link_olh and 
cls_rgw_bucket_unlink_instance) involve writes to two or more related 
omap entries. because step (4) copies over single omap entries, it 
can't preserve the consistency of these relationships once we allow 
mutation. so we'd need to stick with the blocking strategy for 
versioned buckets
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx