rgw: decoupling the bucket replication log from the bucket index

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The bucket index logs used for multisite replication are currently stored in omap on the bucket index shards, along with the rest of the bucket index entries. Storing them in the index was a natural choice, because cls_rgw can write these log entries atomically when completing a bucket index transaction.

To replicate a bucket, other zones process each of its bucket index shard logs independently, and store sync status markers with their position in each shard. This tight coupling between the replication strategy and the bucket's sharding scheme is the main challenge to supporting bucket resharding in multisite, because shuffling these log entries to a new set of shards would invalidate the sync status markers stored in other zones.

My proposal, then, is to move the replication logs out of bucket index shards into a single log per bucket, and extend the consistency model to make up for the lack of atomic writes that we get from cls_rgw.

The existing consistency model for object writes involves a) calling cls_rgw to prepare a bucket index transaction, b) writing the object's head to the data pool, then c) calling cls_rgw to complete the transaction. Since the write in b) is what makes the object visible to GET requests, we can reply to the client without waiting for c) to finish. If either b) or c) fails, the next bucket listing will find an entry that was prepared but not completed, and we'll check whether the head object exists and use the 'dir suggest' call to update the bucket index accordingly.

If we move the replication log to a separate object, we'll need to write to that as well before completing the transaction. And when dir suggest finds head objects for uncompleted transactions, it can (re)write their replication log entries before updating the bucket index. This recovery means that we can still reply to the client before writing to the replication log, so the client won't see any extra latency.

This change also gives us the opportunity to move away from omap and the challenges associated with trimming. Yehuda wrote cls_fifo in https://github.com/ceph/ceph/pull/30797 with the datalog in mind, and that could be a good fit for these bucket replication logs as well.
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx



[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux