On Tue, Mar 6, 2018 at 5:04 PM, Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> wrote: > I've been looking recently at the creation of a kubernetes service > broker for rgw. The idea is to be able to provision a bucket and/or a > user so that it could be used by apps inside kubernetes through the > service controller. As far as what's needed from rgw is the ability to > create a user/bucket and to tear down everything through the broker. > While the broker can just leverage the existing s3/swift and admin > apis that rgw provides in order to achieve all that, it would be a > very cumbersome and fragile solution. Implementing things like data > retention (e.g., remove the data, but only after a week so that we can > cancel the operation if needed) is pretty complicated with the current > framework and are that's probably the wrong place to do it anyway. > In addition, the general issue of a user tear-down has been brought up > by users many times in the past, and there haven't been a sufficient > solution. > I propose we introduce a new job scheduler that will provide a this > kind of service. At first it will allow scheduling a current or future > user and buckets removal that would include user's data purge. Other > deferred action jobs that we can add would be for example rgw > data[structures] scrub, orphan search, bucket resharding. > > Such service requires adding new index that would hold job info > (indexed by both job ids, and timestamp of when job should be > executed). It will also require adding new admin commands and apis to > control it. This is relatively easy, however, things gets much more > complicated if we need it to work correctly in a multi-site > environment. For example, a user data purge in a multi-site > environment that is scheduled on the master zone needs to propagate to > all other zones. For the sake of simplicity, I think that scheduling > of jobs, unless targeted at a specific zone, needs to happen on the > master zone (of the master zonegroup). > > Let's separate this task into separate problems: > > 1. Propagate job scheduling > > I suggest we add a new metadata section for scheduled jobs. Any new > job that is created will have a new metadata object created to > represent it. A creation of that object will also register the job in > the local jobs index. > A metadata object will be synced from the master zone to all other zones, thus: > - jobs will propagate to other zones > - old jobs that are already scheduled and were created before a zone > was created will synced also (via full sync) > > Once a job is complete, its metadata entry will be removed. > > A potential race condition exists, if a job is created and executed > (and thus removed) on the master, before it got synced to the remote > zone. A solution to that would be by using the rados refcount on the > job meta object. We'll hold an extra reference to the object as long > as it's being referenced by the metadata log. Once that log trims the > specific entry, it means that all other zones have already synced that > object, thus it can be removed. > > A job cancellation will be done by modifying a property in the job > meta object, and removing the job from the local jobs index. The > cancellation will spread to all other zones via meta sync. > > 2. Job execution > > Similar to other rgw deferred work threads (such as gc, lifecycle, and > resharding), the jobs thread will use rados object locks to coordinate > execution with other radosgw processes in the same zone. The thread > then will iterate over the jobs index and fetch the jobs that are > current and start executing these. > > A job can be cancelled until it was executed, we'll probably want the > master to lead here (so that we can know whether a job was cancelled > successfully or not). We can add another flag that the master will set > on the job meta object, which will signify that a job has started > executing (thus allowing all other zones to start executing the job). > > The problem in multi-site is that operations that happen on the master > affect the other zones. As an example, let's take a user removal job > that is scheduled on the master. A job metadata object is generated on > the master, which triggers the creation of an entry in the jobs index > at the master zone. The job metadata object is synced to other zones, > and it generates jobs entries in the respective jobs indexes of each > zone. Then once the time for that job to execute it starts executeing > on the master and then on all other zones. This can lead to a case > where the master already removed its local data and metadata for that > user, while other zones still haven't. A user metadata object removal > at the master zone will then trigger a sync operation of this object's > removal (as well as removal of all the bucket info metadata objects > for all the buckets that the user had) that will be sent to all other > zones. This can lead to a case where we'd lose the metadata objects > for the user and for all its buckets in a zone that hasn't removed the > actual user's data. This is a problem since at that point rgw will not > be able tor figure out what data needs to be removed. > One way to deal with it is by modifying the metadata sync to not > remove the user and bucket instance meta objects directly, but rather > create a local job to do that (that will also include data purging), > or just defer to an existing job (if possible). My issue with this > solution is that it complicates the state of the metadata objects > sync. > Maybe instead of deferring the removal of the user/bucket metadata > object, generate a mutated copy of them (using a different name) that > will later be used for data purge. > > Any thoughts? You've got a very specific need you're trying to fulfill, and your solution sounds very general. This is good, but it's still deeply tied in to RGW itself. Did you consider doing something built more deeply into the common Ceph/RADOS components and consumed by others? Or integrating some existing job scheduler into a Ceph environment? (I don't know about the specifics, but you could for instance run beanstalk backed by an RBD volume or something.) That may just be ridiculous based on the scope of work here but a more purpose-built tool could be a lot more efficient, and see a lot more development, than shoving more data into omap and having another thread each RGW runs that periodically polls things? Or maybe I'm overestimating how many jobs you expect this to handle and a simple built-to-purpose approach is the right answer. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html