I've been looking recently at the creation of a kubernetes service broker for rgw. The idea is to be able to provision a bucket and/or a user so that it could be used by apps inside kubernetes through the service controller. As far as what's needed from rgw is the ability to create a user/bucket and to tear down everything through the broker. While the broker can just leverage the existing s3/swift and admin apis that rgw provides in order to achieve all that, it would be a very cumbersome and fragile solution. Implementing things like data retention (e.g., remove the data, but only after a week so that we can cancel the operation if needed) is pretty complicated with the current framework and are that's probably the wrong place to do it anyway. In addition, the general issue of a user tear-down has been brought up by users many times in the past, and there haven't been a sufficient solution. I propose we introduce a new job scheduler that will provide a this kind of service. At first it will allow scheduling a current or future user and buckets removal that would include user's data purge. Other deferred action jobs that we can add would be for example rgw data[structures] scrub, orphan search, bucket resharding. Such service requires adding new index that would hold job info (indexed by both job ids, and timestamp of when job should be executed). It will also require adding new admin commands and apis to control it. This is relatively easy, however, things gets much more complicated if we need it to work correctly in a multi-site environment. For example, a user data purge in a multi-site environment that is scheduled on the master zone needs to propagate to all other zones. For the sake of simplicity, I think that scheduling of jobs, unless targeted at a specific zone, needs to happen on the master zone (of the master zonegroup). Let's separate this task into separate problems: 1. Propagate job scheduling I suggest we add a new metadata section for scheduled jobs. Any new job that is created will have a new metadata object created to represent it. A creation of that object will also register the job in the local jobs index. A metadata object will be synced from the master zone to all other zones, thus: - jobs will propagate to other zones - old jobs that are already scheduled and were created before a zone was created will synced also (via full sync) Once a job is complete, its metadata entry will be removed. A potential race condition exists, if a job is created and executed (and thus removed) on the master, before it got synced to the remote zone. A solution to that would be by using the rados refcount on the job meta object. We'll hold an extra reference to the object as long as it's being referenced by the metadata log. Once that log trims the specific entry, it means that all other zones have already synced that object, thus it can be removed. A job cancellation will be done by modifying a property in the job meta object, and removing the job from the local jobs index. The cancellation will spread to all other zones via meta sync. 2. Job execution Similar to other rgw deferred work threads (such as gc, lifecycle, and resharding), the jobs thread will use rados object locks to coordinate execution with other radosgw processes in the same zone. The thread then will iterate over the jobs index and fetch the jobs that are current and start executing these. A job can be cancelled until it was executed, we'll probably want the master to lead here (so that we can know whether a job was cancelled successfully or not). We can add another flag that the master will set on the job meta object, which will signify that a job has started executing (thus allowing all other zones to start executing the job). The problem in multi-site is that operations that happen on the master affect the other zones. As an example, let's take a user removal job that is scheduled on the master. A job metadata object is generated on the master, which triggers the creation of an entry in the jobs index at the master zone. The job metadata object is synced to other zones, and it generates jobs entries in the respective jobs indexes of each zone. Then once the time for that job to execute it starts executeing on the master and then on all other zones. This can lead to a case where the master already removed its local data and metadata for that user, while other zones still haven't. A user metadata object removal at the master zone will then trigger a sync operation of this object's removal (as well as removal of all the bucket info metadata objects for all the buckets that the user had) that will be sent to all other zones. This can lead to a case where we'd lose the metadata objects for the user and for all its buckets in a zone that hasn't removed the actual user's data. This is a problem since at that point rgw will not be able tor figure out what data needs to be removed. One way to deal with it is by modifying the metadata sync to not remove the user and bucket instance meta objects directly, but rather create a local job to do that (that will also include data purging), or just defer to an existing job (if possible). My issue with this solution is that it complicates the state of the metadata objects sync. Maybe instead of deferring the removal of the user/bucket metadata object, generate a mutated copy of them (using a different name) that will later be used for data purge. Any thoughts? Yehuda -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html