On Mon, Dec 9, 2019 at 11:35 PM Casey Bodley <cbodley@xxxxxxxxxx> wrote: > > The bucket index logs used for multisite replication are currently > stored in omap on the bucket index shards, along with the rest of the > bucket index entries. Storing them in the index was a natural choice, > because cls_rgw can write these log entries atomically when completing a > bucket index transaction. > > To replicate a bucket, other zones process each of its bucket index > shard logs independently, and store sync status markers with their > position in each shard. This tight coupling between the replication > strategy and the bucket's sharding scheme is the main challenge to > supporting bucket resharding in multisite, because shuffling these log > entries to a new set of shards would invalidate the sync status markers > stored in other zones. Note that the new bucket granularity work tackle part of the problem and lays foundation for solving it by managing unbalanced replication, where source bucket and destination bucket have different number of shards. So when a (target) bucket is resharded, we could still craft new markers for it to track its original source, even if they don't have the same number of shards (this is not implemented, but should be relatively easy to do). In the other direction, when source is resharded, currently the new bucket instance is handled as a new bucket at the target, so it triggers a full sync, but only actual new entries are being fetched. We can probably find a better and more optimal solution where we finish syncing the old entries from the old instance, and have the new one set to incremental sync from the start. That been said, I'm not against decoupling the logs and the index. > > My proposal, then, is to move the replication logs out of bucket index > shards into a single log per bucket, and extend the consistency model to > make up for the lack of atomic writes that we get from cls_rgw. Sharded log? If it's not sharded then you're going to introduce an object where all IO for that bucket will serialize over with high enough pressure (as I assume writes to it will be async). > > The existing consistency model for object writes involves a) calling > cls_rgw to prepare a bucket index transaction, b) writing the object's > head to the data pool, then c) calling cls_rgw to complete the > transaction. Since the write in b) is what makes the object visible to > GET requests, we can reply to the client without waiting for c) to > finish. If either b) or c) fails, the next bucket listing will find an > entry that was prepared but not completed, and we'll check whether the > head object exists and use the 'dir suggest' call to update the bucket > index accordingly. > > If we move the replication log to a separate object, we'll need to write > to that as well before completing the transaction. And when dir suggest > finds head objects for uncompleted transactions, it can (re)write their > replication log entries before updating the bucket index. This recovery > means that we can still reply to the client before writing to the > replication log, so the client won't see any extra latency. We'd need to write to that log after head has been written, otherwise clients reading it will race with the head creation. In which case I'm not sure how we don't introduce extra latency, since we can complete the transaction only after it has been written (or we'd potentially lose this log entry). There is the option to only write to the log (async) and then have a lazy process in a separate thread that goes over that log and completes the transactions, or hand it over to a separate workqueue. > > This change also gives us the opportunity to move away from omap and the > challenges associated with trimming. Yehuda wrote cls_fifo in > https://github.com/ceph/ceph/pull/30797 with the datalog in mind, and > that could be a good fit for these bucket replication logs as well. Yeah, cls_fifo would be a good fit for it. Yehuda > _______________________________________________ > Dev mailing list -- dev@xxxxxxx > To unsubscribe send an email to dev-leave@xxxxxxx > _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx