The bucket index logs used for multisite replication are currently
stored in omap on the bucket index shards, along with the rest of the
bucket index entries. Storing them in the index was a natural choice,
because cls_rgw can write these log entries atomically when completing a
bucket index transaction.
To replicate a bucket, other zones process each of its bucket index
shard logs independently, and store sync status markers with their
position in each shard. This tight coupling between the replication
strategy and the bucket's sharding scheme is the main challenge to
supporting bucket resharding in multisite, because shuffling these log
entries to a new set of shards would invalidate the sync status markers
stored in other zones.
My proposal, then, is to move the replication logs out of bucket index
shards into a single log per bucket, and extend the consistency model to
make up for the lack of atomic writes that we get from cls_rgw.
The existing consistency model for object writes involves a) calling
cls_rgw to prepare a bucket index transaction, b) writing the object's
head to the data pool, then c) calling cls_rgw to complete the
transaction. Since the write in b) is what makes the object visible to
GET requests, we can reply to the client without waiting for c) to
finish. If either b) or c) fails, the next bucket listing will find an
entry that was prepared but not completed, and we'll check whether the
head object exists and use the 'dir suggest' call to update the bucket
index accordingly.
If we move the replication log to a separate object, we'll need to write
to that as well before completing the transaction. And when dir suggest
finds head objects for uncompleted transactions, it can (re)write their
replication log entries before updating the bucket index. This recovery
means that we can still reply to the client before writing to the
replication log, so the client won't see any extra latency.
This change also gives us the opportunity to move away from omap and the
challenges associated with trimming. Yehuda wrote cls_fifo in
https://github.com/ceph/ceph/pull/30797 with the datalog in mind, and
that could be a good fit for these bucket replication logs as well.
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx