On Fri, Mar 9, 2018 at 7:30 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > On Tue, Mar 6, 2018 at 5:04 PM, Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> wrote: >> I've been looking recently at the creation of a kubernetes service >> broker for rgw. The idea is to be able to provision a bucket and/or a >> user so that it could be used by apps inside kubernetes through the >> service controller. As far as what's needed from rgw is the ability to >> create a user/bucket and to tear down everything through the broker. >> While the broker can just leverage the existing s3/swift and admin >> apis that rgw provides in order to achieve all that, it would be a >> very cumbersome and fragile solution. Implementing things like data >> retention (e.g., remove the data, but only after a week so that we can >> cancel the operation if needed) is pretty complicated with the current >> framework and are that's probably the wrong place to do it anyway. >> In addition, the general issue of a user tear-down has been brought up >> by users many times in the past, and there haven't been a sufficient >> solution. >> I propose we introduce a new job scheduler that will provide a this >> kind of service. At first it will allow scheduling a current or future >> user and buckets removal that would include user's data purge. Other >> deferred action jobs that we can add would be for example rgw >> data[structures] scrub, orphan search, bucket resharding. >> >> Such service requires adding new index that would hold job info >> (indexed by both job ids, and timestamp of when job should be >> executed). It will also require adding new admin commands and apis to >> control it. This is relatively easy, however, things gets much more >> complicated if we need it to work correctly in a multi-site >> environment. For example, a user data purge in a multi-site >> environment that is scheduled on the master zone needs to propagate to >> all other zones. For the sake of simplicity, I think that scheduling >> of jobs, unless targeted at a specific zone, needs to happen on the >> master zone (of the master zonegroup). >> >> Let's separate this task into separate problems: >> >> 1. Propagate job scheduling >> >> I suggest we add a new metadata section for scheduled jobs. Any new >> job that is created will have a new metadata object created to >> represent it. A creation of that object will also register the job in >> the local jobs index. >> A metadata object will be synced from the master zone to all other zones, thus: >> - jobs will propagate to other zones >> - old jobs that are already scheduled and were created before a zone >> was created will synced also (via full sync) >> >> Once a job is complete, its metadata entry will be removed. >> >> A potential race condition exists, if a job is created and executed >> (and thus removed) on the master, before it got synced to the remote >> zone. A solution to that would be by using the rados refcount on the >> job meta object. We'll hold an extra reference to the object as long >> as it's being referenced by the metadata log. Once that log trims the >> specific entry, it means that all other zones have already synced that >> object, thus it can be removed. >> >> A job cancellation will be done by modifying a property in the job >> meta object, and removing the job from the local jobs index. The >> cancellation will spread to all other zones via meta sync. >> >> 2. Job execution >> >> Similar to other rgw deferred work threads (such as gc, lifecycle, and >> resharding), the jobs thread will use rados object locks to coordinate >> execution with other radosgw processes in the same zone. The thread >> then will iterate over the jobs index and fetch the jobs that are >> current and start executing these. >> >> A job can be cancelled until it was executed, we'll probably want the >> master to lead here (so that we can know whether a job was cancelled >> successfully or not). We can add another flag that the master will set >> on the job meta object, which will signify that a job has started >> executing (thus allowing all other zones to start executing the job). >> >> The problem in multi-site is that operations that happen on the master >> affect the other zones. As an example, let's take a user removal job >> that is scheduled on the master. A job metadata object is generated on >> the master, which triggers the creation of an entry in the jobs index >> at the master zone. The job metadata object is synced to other zones, >> and it generates jobs entries in the respective jobs indexes of each >> zone. Then once the time for that job to execute it starts executeing >> on the master and then on all other zones. This can lead to a case >> where the master already removed its local data and metadata for that >> user, while other zones still haven't. A user metadata object removal >> at the master zone will then trigger a sync operation of this object's >> removal (as well as removal of all the bucket info metadata objects >> for all the buckets that the user had) that will be sent to all other >> zones. This can lead to a case where we'd lose the metadata objects >> for the user and for all its buckets in a zone that hasn't removed the >> actual user's data. This is a problem since at that point rgw will not >> be able tor figure out what data needs to be removed. >> One way to deal with it is by modifying the metadata sync to not >> remove the user and bucket instance meta objects directly, but rather >> create a local job to do that (that will also include data purging), >> or just defer to an existing job (if possible). My issue with this >> solution is that it complicates the state of the metadata objects >> sync. >> Maybe instead of deferring the removal of the user/bucket metadata >> object, generate a mutated copy of them (using a different name) that >> will later be used for data purge. >> >> Any thoughts? > > You've got a very specific need you're trying to fulfill, and your > solution sounds very general. This is good, but it's still deeply tied > in to RGW itself. > > Did you consider doing something built more deeply into the common > Ceph/RADOS components and consumed by others? Or integrating some > existing job scheduler into a Ceph environment? (I don't know about > the specifics, but you could for instance run beanstalk backed by an > RBD volume or something.) The idea of a generic system that can use an actual message broker sounds great. I would be concerned to even consider using beanstalk, if we go down that route, lets please use a robust/scalable solution like RabbitMQ. > > That may just be ridiculous based on the scope of work here but a more > purpose-built tool could be a lot more efficient, and see a lot more > development, than shoving more data into omap and having another > thread each RGW runs that periodically polls things? This is a great point too. A message broker handles all of these with ease (no need for polling), and can accommodate all kinds of rules (expiration, scheduling, etc...), permissions, and propagation of jobs. > Or maybe I'm overestimating how many jobs you expect this to handle > and a simple built-to-purpose approach is the right answer. > -Greg > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html