Re: rgw job scheduler

Alfredo Deza <adeza@xxxxxxxxxx> · Mon, 12 Mar 2018 08:20:32 -0400

On Fri, Mar 9, 2018 at 7:30 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> On Tue, Mar 6, 2018 at 5:04 PM, Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> wrote:
>> I've been looking recently at the creation of a kubernetes service
>> broker for rgw. The idea is to be able to provision a bucket and/or a
>> user so that it could be used by apps inside kubernetes through the
>> service controller. As far as what's needed from rgw is the ability to
>> create a user/bucket and to tear down everything through the broker.
>> While the broker can just leverage the existing s3/swift and admin
>> apis that rgw provides in order to achieve all that, it would be a
>> very cumbersome and fragile solution. Implementing things like data
>> retention (e.g., remove the data, but only after a week so that we can
>> cancel the operation if needed) is pretty complicated with the current
>> framework and are that's probably the wrong place to do it anyway.
>> In addition, the general issue of a user tear-down has been brought up
>> by users many times in the past, and there haven't been a sufficient
>> solution.
>> I propose we introduce a new job scheduler that will provide a this
>> kind of service. At first it will allow scheduling a current or future
>> user and buckets removal that would include user's data purge. Other
>> deferred action jobs that we can add would be for example rgw
>> data[structures] scrub, orphan search, bucket resharding.
>>
>> Such service requires adding new index that would hold job info
>> (indexed by both job ids, and timestamp of when job should be
>> executed). It will also require adding new admin commands and apis to
>> control it. This is relatively easy, however, things gets much more
>> complicated if we need it to work correctly in a multi-site
>> environment. For example, a user data purge in a multi-site
>> environment that is scheduled on the master zone needs to propagate to
>> all other zones. For the sake of simplicity, I think that scheduling
>> of jobs, unless targeted at a specific zone, needs to happen on the
>> master zone (of the master zonegroup).
>>
>> Let's separate this task into separate problems:
>>
>> 1. Propagate job scheduling
>>
>> I suggest we add a new metadata section for scheduled jobs. Any new
>> job that is created will have a new metadata object created to
>> represent it. A creation of that object will also register the job in
>> the local jobs index.
>> A metadata object will be synced from the master zone to all other zones, thus:
>>  - jobs will propagate to other zones
>>  - old jobs that are already scheduled and were created before a zone
>> was created will synced also (via full sync)
>>
>> Once a job is complete, its metadata entry will be removed.
>>
>> A potential race condition exists, if a job is created and executed
>> (and thus removed) on the master, before it got synced to the remote
>> zone. A solution to that would be by using the rados refcount on the
>> job meta object. We'll hold an extra reference to the object as long
>> as it's being referenced by the metadata log. Once that log trims the
>> specific entry, it means that all other zones have already synced that
>> object, thus it can be removed.
>>
>> A job cancellation will be done by modifying a property in the job
>> meta object, and removing the job from the local jobs index. The
>> cancellation will spread to all other zones via meta sync.
>>
>> 2. Job execution
>>
>> Similar to other rgw deferred work threads (such as gc, lifecycle, and
>> resharding), the jobs thread will use rados object locks to coordinate
>> execution with other radosgw processes in the same zone. The thread
>> then will iterate over the jobs index and fetch the jobs that are
>> current and start executing these.
>>
>> A job can be cancelled until it was executed, we'll probably want the
>> master to lead here (so that we can know whether a job was cancelled
>> successfully or not). We can add another flag that the master will set
>> on the job meta object, which will signify that a job has started
>> executing (thus allowing all other zones to start executing the job).
>>
>> The problem in multi-site is that operations that happen on the master
>> affect the other zones. As an example, let's take a user removal job
>> that is scheduled on the master. A job metadata object is generated on
>> the master, which triggers the creation of an entry in the jobs index
>> at the master zone. The job metadata object is synced to other zones,
>> and it generates jobs entries in the respective jobs indexes of each
>> zone. Then once the time for that job to execute it starts executeing
>> on the master and then on all other zones. This can lead to a case
>> where the master already removed its local data and metadata for that
>> user, while other zones still haven't. A user metadata object removal
>> at the master zone will then trigger a sync operation of this object's
>> removal (as well as removal of all the bucket info metadata objects
>> for all the buckets that the user had) that will be sent to all other
>> zones. This can lead to a case where we'd lose the metadata objects
>> for the user and for all its buckets in a zone that hasn't removed the
>> actual user's data. This is a problem since at that point rgw will not
>> be able tor figure out what data needs to be removed.
>> One way to deal with it is by modifying the metadata sync to not
>> remove the user and bucket instance meta objects directly, but rather
>> create a local job to do that (that will also include data purging),
>> or just defer to an existing job (if possible). My issue with this
>> solution is that it complicates the state of the metadata objects
>> sync.
>> Maybe instead of deferring the removal of the user/bucket metadata
>> object, generate a mutated copy of them (using a different name) that
>> will later be used for data purge.
>>
>> Any thoughts?
>
> You've got a very specific need you're trying to fulfill, and your
> solution sounds very general. This is good, but it's still deeply tied
> in to RGW itself.
>
> Did you consider doing something built more deeply into the common
> Ceph/RADOS components and consumed by others? Or integrating some
> existing job scheduler into a Ceph environment? (I don't know about
> the specifics, but you could for instance run beanstalk backed by an
> RBD volume or something.)

The idea of a generic system that can use an actual message broker
sounds great. I would be concerned to even consider
using beanstalk, if we go down that route, lets please use a
robust/scalable solution like RabbitMQ.

>
> That may just be ridiculous based on the scope of work here but a more
> purpose-built tool could be a lot more efficient, and see a lot more
> development, than shoving more data into omap and having another
> thread each RGW runs that periodically polls things?

This is a great point too. A message broker handles all of these with
ease (no need for polling), and can accommodate all
kinds of rules (expiration, scheduling, etc...), permissions, and
propagation of jobs.

> Or maybe I'm overestimating how many jobs you expect this to handle
> and a simple built-to-purpose approach is the right answer.
> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html