Re: rgw job scheduler

Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> · Wed, 7 Mar 2018 14:37:29 -0800

On Wed, Mar 7, 2018 at 7:41 AM, Casey Bodley <cbodley@xxxxxxxxxx> wrote:
>
> On 03/06/2018 08:04 PM, Yehuda Sadeh-Weinraub wrote:
>>
>> I've been looking recently at the creation of a kubernetes service
>> broker for rgw. The idea is to be able to provision a bucket and/or a
>> user so that it could be used by apps inside kubernetes through the
>> service controller. As far as what's needed from rgw is the ability to
>> create a user/bucket and to tear down everything through the broker.
>> While the broker can just leverage the existing s3/swift and admin
>> apis that rgw provides in order to achieve all that, it would be a
>> very cumbersome and fragile solution. Implementing things like data
>> retention (e.g., remove the data, but only after a week so that we can
>> cancel the operation if needed) is pretty complicated with the current
>> framework and are that's probably the wrong place to do it anyway.
>> In addition, the general issue of a user tear-down has been brought up
>> by users many times in the past, and there haven't been a sufficient
>> solution.
>> I propose we introduce a new job scheduler that will provide a this
>> kind of service. At first it will allow scheduling a current or future
>> user and buckets removal that would include user's data purge. Other
>> deferred action jobs that we can add would be for example rgw
>> data[structures] scrub, orphan search, bucket resharding.
>>
>> Such service requires adding new index that would hold job info
>> (indexed by both job ids, and timestamp of when job should be
>> executed). It will also require adding new admin commands and apis to
>> control it. This is relatively easy, however, things gets much more
>> complicated if we need it to work correctly in a multi-site
>> environment. For example, a user data purge in a multi-site
>> environment that is scheduled on the master zone needs to propagate to
>> all other zones. For the sake of simplicity, I think that scheduling
>> of jobs, unless targeted at a specific zone, needs to happen on the
>> master zone (of the master zonegroup).
>>
>> Let's separate this task into separate problems:
>>
>> 1. Propagate job scheduling
>>
>> I suggest we add a new metadata section for scheduled jobs. Any new
>> job that is created will have a new metadata object created to
>> represent it. A creation of that object will also register the job in
>> the local jobs index.
>> A metadata object will be synced from the master zone to all other zones,
>> thus:
>>   - jobs will propagate to other zones
>>   - old jobs that are already scheduled and were created before a zone
>> was created will synced also (via full sync)
>>
>> Once a job is complete, its metadata entry will be removed.
>>
>> A potential race condition exists, if a job is created and executed
>> (and thus removed) on the master, before it got synced to the remote
>> zone. A solution to that would be by using the rados refcount on the
>> job meta object. We'll hold an extra reference to the object as long
>> as it's being referenced by the metadata log. Once that log trims the
>> specific entry, it means that all other zones have already synced that
>> object, thus it can be removed.
>>
>> A job cancellation will be done by modifying a property in the job
>> meta object, and removing the job from the local jobs index. The
>> cancellation will spread to all other zones via meta sync.
>>
>> 2. Job execution
>>
>> Similar to other rgw deferred work threads (such as gc, lifecycle, and
>> resharding), the jobs thread will use rados object locks to coordinate
>> execution with other radosgw processes in the same zone. The thread
>> then will iterate over the jobs index and fetch the jobs that are
>> current and start executing these.
>>
>> A job can be cancelled until it was executed, we'll probably want the
>> master to lead here (so that we can know whether a job was cancelled
>> successfully or not). We can add another flag that the master will set
>> on the job meta object, which will signify that a job has started
>> executing (thus allowing all other zones to start executing the job).
>>
>> The problem in multi-site is that operations that happen on the master
>> affect the other zones. As an example, let's take a user removal job
>> that is scheduled on the master. A job metadata object is generated on
>> the master, which triggers the creation of an entry in the jobs index
>> at the master zone. The job metadata object is synced to other zones,
>> and it generates jobs entries in the respective jobs indexes of each
>> zone. Then once the time for that job to execute it starts executeing
>> on the master and then on all other zones. This can lead to a case
>> where the master already removed its local data and metadata for that
>> user, while other zones still haven't. A user metadata object removal
>> at the master zone will then trigger a sync operation of this object's
>> removal (as well as removal of all the bucket info metadata objects
>> for all the buckets that the user had) that will be sent to all other
>> zones. This can lead to a case where we'd lose the metadata objects
>> for the user and for all its buckets in a zone that hasn't removed the
>> actual user's data. This is a problem since at that point rgw will not
>> be able tor figure out what data needs to be removed.
>> One way to deal with it is by modifying the metadata sync to not
>> remove the user and bucket instance meta objects directly, but rather
>> create a local job to do that (that will also include data purging),
>> or just defer to an existing job (if possible). My issue with this
>> solution is that it complicates the state of the metadata objects
>> sync.
>> Maybe instead of deferring the removal of the user/bucket metadata
>> object, generate a mutated copy of them (using a different name) that
>> will later be used for data purge.
>>
>> Any thoughts?
>>
>> Yehuda
>
>
> Hi Yehuda,
>
> If we're only looking at the removal of users and their buckets/data, I
> think the simplest model would be one where jobs only execute on the master.
> Removal of the user and bucket metadata generates all of the mdlog entries
> necessary for other zones to do the same via the existing meta sync. The
> only missing piece there is removing the objects in those deleted buckets,
> which is an issue that we already face with bucket deletion in multisite
> (http://tracker.ceph.com/issues/20802). We still need to distribute the job
> metadata to all zones so we can handle failover to a different metadata
> master, but wouldn't need this extra 'started executing' state on the job or
> the renaming strategy to keep user/bucket metadata around after deletion.
>

My worry here is that if we rely solely on mdlog entries we lose the
higher level job information, and it'll be too easy to lose user's
data accidentally (e.g., removal of meta object on master will
immediately trigger user's data purge on all the other zones). It may
be the simplest way to get it working, but I'm not sure I want to take
this path. I'd rather have something explicit that triggers the job.
One way to do it maybe is when a user's (or bucket's) metadata object
is removed we can then check to see whether there's any job pending
that corresponds to that specific entry, and if so then allow it to
run. I don't really like it.
Maybe the following (a variant of previous suggestion):

on metadata object deletion:
 - copy meta object to a different name (e.g., same pool, same name,
different namespace)
 - provide a mechanism to access the copy of the metadata objects
(shouldn't be that hard)
 - remove original meta object

on bucket/user purge:
 - try to access entity's metadata object
  - if not found, try using removed one
 - purge everything

periodically:
 - trim old objects in removed meta pool (either need indexing, or in
addition create in a second namespace 'symlinks' that use timestamp in
their names)

> Of the other use cases listed, it sounds like the ones for scrub and orphan
> search should be zone-local only, and not require any interaction with
> multisite sync. And bucket resharding stands out as a harder problem, and
> I'm not sure there's much utility in scheduling a deferred bucket reshard -
> especially if dynamic resharding (once it works in multisite) could happen
> at any point before that deferred job starts.

Dynamic resharding is actually a good example why coordination might
be needed. Such a system could be used to fix dynamic resharding in
multisite (job will be scheduled, master will then coordinate its
execution).

>
> Are there other use cases that do require multisite coordination, but don't
> fit a model where only the master needs to execute the jobs?

In general scheduling some general metadata mutation. E.g., suspend
user at specific time (because user didn't pay monthly bill).

Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html