Re: rgw job scheduler

Casey Bodley <cbodley@xxxxxxxxxx> · Wed, 7 Mar 2018 10:41:58 -0500

On 03/06/2018 08:04 PM, Yehuda Sadeh-Weinraub wrote:
I've been looking recently at the creation of a kubernetes service
broker for rgw. The idea is to be able to provision a bucket and/or a
user so that it could be used by apps inside kubernetes through the
service controller. As far as what's needed from rgw is the ability to
create a user/bucket and to tear down everything through the broker.
While the broker can just leverage the existing s3/swift and admin
apis that rgw provides in order to achieve all that, it would be a
very cumbersome and fragile solution. Implementing things like data
retention (e.g., remove the data, but only after a week so that we can
cancel the operation if needed) is pretty complicated with the current
framework and are that's probably the wrong place to do it anyway.
In addition, the general issue of a user tear-down has been brought up
by users many times in the past, and there haven't been a sufficient
solution.
I propose we introduce a new job scheduler that will provide a this
kind of service. At first it will allow scheduling a current or future
user and buckets removal that would include user's data purge. Other
deferred action jobs that we can add would be for example rgw
data[structures] scrub, orphan search, bucket resharding.

Such service requires adding new index that would hold job info
(indexed by both job ids, and timestamp of when job should be
executed). It will also require adding new admin commands and apis to
control it. This is relatively easy, however, things gets much more
complicated if we need it to work correctly in a multi-site
environment. For example, a user data purge in a multi-site
environment that is scheduled on the master zone needs to propagate to
all other zones. For the sake of simplicity, I think that scheduling
of jobs, unless targeted at a specific zone, needs to happen on the
master zone (of the master zonegroup).

Let's separate this task into separate problems:

1. Propagate job scheduling

I suggest we add a new metadata section for scheduled jobs. Any new
job that is created will have a new metadata object created to
represent it. A creation of that object will also register the job in
the local jobs index.
A metadata object will be synced from the master zone to all other zones, thus:
  - jobs will propagate to other zones
  - old jobs that are already scheduled and were created before a zone
was created will synced also (via full sync)

Once a job is complete, its metadata entry will be removed.

A potential race condition exists, if a job is created and executed
(and thus removed) on the master, before it got synced to the remote
zone. A solution to that would be by using the rados refcount on the
job meta object. We'll hold an extra reference to the object as long
as it's being referenced by the metadata log. Once that log trims the
specific entry, it means that all other zones have already synced that
object, thus it can be removed.

A job cancellation will be done by modifying a property in the job
meta object, and removing the job from the local jobs index. The
cancellation will spread to all other zones via meta sync.

2. Job execution

Similar to other rgw deferred work threads (such as gc, lifecycle, and
resharding), the jobs thread will use rados object locks to coordinate
execution with other radosgw processes in the same zone. The thread
then will iterate over the jobs index and fetch the jobs that are
current and start executing these.

A job can be cancelled until it was executed, we'll probably want the
master to lead here (so that we can know whether a job was cancelled
successfully or not). We can add another flag that the master will set
on the job meta object, which will signify that a job has started
executing (thus allowing all other zones to start executing the job).

The problem in multi-site is that operations that happen on the master
affect the other zones. As an example, let's take a user removal job
that is scheduled on the master. A job metadata object is generated on
the master, which triggers the creation of an entry in the jobs index
at the master zone. The job metadata object is synced to other zones,
and it generates jobs entries in the respective jobs indexes of each
zone. Then once the time for that job to execute it starts executeing
on the master and then on all other zones. This can lead to a case
where the master already removed its local data and metadata for that
user, while other zones still haven't. A user metadata object removal
at the master zone will then trigger a sync operation of this object's
removal (as well as removal of all the bucket info metadata objects
for all the buckets that the user had) that will be sent to all other
zones. This can lead to a case where we'd lose the metadata objects
for the user and for all its buckets in a zone that hasn't removed the
actual user's data. This is a problem since at that point rgw will not
be able tor figure out what data needs to be removed.
One way to deal with it is by modifying the metadata sync to not
remove the user and bucket instance meta objects directly, but rather
create a local job to do that (that will also include data purging),
or just defer to an existing job (if possible). My issue with this
solution is that it complicates the state of the metadata objects
sync.
Maybe instead of deferring the removal of the user/bucket metadata
object, generate a mutated copy of them (using a different name) that
will later be used for data purge.

Any thoughts?

Yehuda

Hi Yehuda,

If we're only looking at the removal of users and their buckets/data, I 
think the simplest model would be one where jobs only execute on the 
master. Removal of the user and bucket metadata generates all of the 
mdlog entries necessary for other zones to do the same via the existing 
meta sync. The only missing piece there is removing the objects in those 
deleted buckets, which is an issue that we already face with bucket 
deletion in multisite (http://tracker.ceph.com/issues/20802). We still 
need to distribute the job metadata to all zones so we can handle 
failover to a different metadata master, but wouldn't need this extra 
'started executing' state on the job or the renaming strategy to keep 
user/bucket metadata around after deletion.

Of the other use cases listed, it sounds like the ones for scrub and 
orphan search should be zone-local only, and not require any interaction 
with multisite sync. And bucket resharding stands out as a harder 
problem, and I'm not sure there's much utility in scheduling a deferred 
bucket reshard - especially if dynamic resharding (once it works in 
multisite) could happen at any point before that deferred job starts.

Are there other use cases that do require multisite coordination, but 
don't fit a model where only the master needs to execute the jobs?

Casey
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html