rgw: decoupling the bucket replication log from the bucket index

Casey Bodley <cbodley@xxxxxxxxxx> · Mon, 9 Dec 2019 16:35:10 -0500

The bucket index logs used for multisite replication are currently 
stored in omap on the bucket index shards, along with the rest of the 
bucket index entries. Storing them in the index was a natural choice, 
because cls_rgw can write these log entries atomically when completing a 
bucket index transaction.

To replicate a bucket, other zones process each of its bucket index 
shard logs independently, and store sync status markers with their 
position in each shard. This tight coupling between the replication 
strategy and the bucket's sharding scheme is the main challenge to 
supporting bucket resharding in multisite, because shuffling these log 
entries to a new set of shards would invalidate the sync status markers 
stored in other zones.

My proposal, then, is to move the replication logs out of bucket index 
shards into a single log per bucket, and extend the consistency model to 
make up for the lack of atomic writes that we get from cls_rgw.

The existing consistency model for object writes involves a) calling 
cls_rgw to prepare a bucket index transaction, b) writing the object's 
head to the data pool, then c) calling cls_rgw to complete the 
transaction. Since the write in b) is what makes the object visible to 
GET requests, we can reply to the client without waiting for c) to 
finish. If either b) or c) fails, the next bucket listing will find an 
entry that was prepared but not completed, and we'll check whether the 
head object exists and use the 'dir suggest' call to update the bucket 
index accordingly.

If we move the replication log to a separate object, we'll need to write 
to that as well before completing the transaction. And when dir suggest 
finds head objects for uncompleted transactions, it can (re)write their 
replication log entries before updating the bucket index. This recovery 
means that we can still reply to the client before writing to the 
replication log, so the client won't see any extra latency.

This change also gives us the opportunity to move away from omap and the 
challenges associated with trimming. Yehuda wrote cls_fifo in 
https://github.com/ceph/ceph/pull/30797 with the datalog in mind, and 
that could be a good fit for these bucket replication logs as well.
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx