Re: rgw: decoupling the bucket replication log from the bucket index

Casey Bodley <cbodley@xxxxxxxxxx> · Tue, 10 Dec 2019 14:15:55 -0500

Thanks for the feedback!

On 12/10/19 1:52 AM, Yehuda Sadeh-Weinraub wrote:
On Mon, Dec 9, 2019 at 11:35 PM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
The bucket index logs used for multisite replication are currently
stored in omap on the bucket index shards, along with the rest of the
bucket index entries. Storing them in the index was a natural choice,
because cls_rgw can write these log entries atomically when completing a
bucket index transaction.

To replicate a bucket, other zones process each of its bucket index
shard logs independently, and store sync status markers with their
position in each shard. This tight coupling between the replication
strategy and the bucket's sharding scheme is the main challenge to
supporting bucket resharding in multisite, because shuffling these log
entries to a new set of shards would invalidate the sync status markers
stored in other zones.
Note that the new bucket granularity work tackle part of the problem
and lays foundation for solving it by managing unbalanced replication,
where source bucket and destination bucket have different number of
shards.
So when a (target) bucket is resharded, we could still craft new
markers for it to track its original source, even if they don't have
the same number of shards (this is not implemented, but should be
relatively easy to do).
In the other direction, when source is resharded, currently the new
bucket instance is handled as a new bucket at the target, so it
triggers a full sync, but only actual new entries are being fetched.
We can probably find a better and more optimal solution where we
finish syncing the old entries from the old instance, and have the new
one set to incremental sync from the start.

Yeah, I'd like to find an approach that avoids full sync on reshard, 
because those buckets will tend to be the big ones. But I also don't 
want to end up in the same situation as bucket deletion, where we can't 
delete the old bucket index shards until all other zones finish 
processing its logs.

That been said, I'm not against decoupling the logs and the index.

My proposal, then, is to move the replication logs out of bucket index
shards into a single log per bucket, and extend the consistency model to
make up for the lack of atomic writes that we get from cls_rgw.
Sharded log? If it's not sharded then you're going to introduce an
object where all IO for that bucket will serialize over with high
enough pressure (as I assume writes to it will be async).

I agree that the log write latency is important for scalability. It 
won't impact PutObj performance directly because it's async, but it will 
increase the average time delta between the index prepares and 
completes, so bucket listings will tend to do more recovery work with 
dir suggest.

Sharding wouldn't be my first choice though. It can reduce the write 
contention by a factor of num-shards, but we'd have to reintroduce 
dynamic sharding to scale this up for large buckets.

Instead, we can batch up log writes in radosgw until the previous batch 
finishes. That limits write contention to the number of gateways, rather 
than the number of PutObj ops. I think batching will be important to get 
good performance out of cls_fifo anyway, where there's some overhead to 
discover the approximate append position, resend appends that land on 
full rados objects, etc.

If we're batching up the log writes, it could also make sense to add a 
batch interface to cls_rgw for the index completions.

The existing consistency model for object writes involves a) calling
cls_rgw to prepare a bucket index transaction, b) writing the object's
head to the data pool, then c) calling cls_rgw to complete the
transaction. Since the write in b) is what makes the object visible to
GET requests, we can reply to the client without waiting for c) to
finish. If either b) or c) fails, the next bucket listing will find an
entry that was prepared but not completed, and we'll check whether the
head object exists and use the 'dir suggest' call to update the bucket
index accordingly.

If we move the replication log to a separate object, we'll need to write
to that as well before completing the transaction. And when dir suggest
finds head objects for uncompleted transactions, it can (re)write their
replication log entries before updating the bucket index. This recovery
means that we can still reply to the client before writing to the
replication log, so the client won't see any extra latency.
We'd need to write to that log after head has been written, otherwise
clients reading it will race with the head creation. In which case I'm
not sure how we don't introduce extra latency, since we can complete
the transaction only after it has been written (or we'd potentially
lose this log entry).
There is the option to only write to the log (async) and then have a
lazy process in a separate thread that goes over that log and
completes the transactions, or hand it over to a separate workqueue.

These can still be asynchronous with respect to the http response, as 
long as the log writes succeed before we complete the index 
transactions. For example, by having the log write's AioCompletion 
callback schedule the index completion op. Batching makes this more 
complicated, but still has to provide the same guarantee.

This change also gives us the opportunity to move away from omap and the
challenges associated with trimming. Yehuda wrote cls_fifo in
https://github.com/ceph/ceph/pull/30797 with the datalog in mind, and
that could be a good fit for these bucket replication logs as well.
Yeah, cls_fifo would be a good fit for it.

Yehuda

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx