rgw job scheduler

Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> · Tue, 6 Mar 2018 17:04:24 -0800

I've been looking recently at the creation of a kubernetes service
broker for rgw. The idea is to be able to provision a bucket and/or a
user so that it could be used by apps inside kubernetes through the
service controller. As far as what's needed from rgw is the ability to
create a user/bucket and to tear down everything through the broker.
While the broker can just leverage the existing s3/swift and admin
apis that rgw provides in order to achieve all that, it would be a
very cumbersome and fragile solution. Implementing things like data
retention (e.g., remove the data, but only after a week so that we can
cancel the operation if needed) is pretty complicated with the current
framework and are that's probably the wrong place to do it anyway.
In addition, the general issue of a user tear-down has been brought up
by users many times in the past, and there haven't been a sufficient
solution.
I propose we introduce a new job scheduler that will provide a this
kind of service. At first it will allow scheduling a current or future
user and buckets removal that would include user's data purge. Other
deferred action jobs that we can add would be for example rgw
data[structures] scrub, orphan search, bucket resharding.

Such service requires adding new index that would hold job info
(indexed by both job ids, and timestamp of when job should be
executed). It will also require adding new admin commands and apis to
control it. This is relatively easy, however, things gets much more
complicated if we need it to work correctly in a multi-site
environment. For example, a user data purge in a multi-site
environment that is scheduled on the master zone needs to propagate to
all other zones. For the sake of simplicity, I think that scheduling
of jobs, unless targeted at a specific zone, needs to happen on the
master zone (of the master zonegroup).

Let's separate this task into separate problems:

1. Propagate job scheduling

I suggest we add a new metadata section for scheduled jobs. Any new
job that is created will have a new metadata object created to
represent it. A creation of that object will also register the job in
the local jobs index.
A metadata object will be synced from the master zone to all other zones, thus:
 - jobs will propagate to other zones
 - old jobs that are already scheduled and were created before a zone
was created will synced also (via full sync)

Once a job is complete, its metadata entry will be removed.

A potential race condition exists, if a job is created and executed
(and thus removed) on the master, before it got synced to the remote
zone. A solution to that would be by using the rados refcount on the
job meta object. We'll hold an extra reference to the object as long
as it's being referenced by the metadata log. Once that log trims the
specific entry, it means that all other zones have already synced that
object, thus it can be removed.

A job cancellation will be done by modifying a property in the job
meta object, and removing the job from the local jobs index. The
cancellation will spread to all other zones via meta sync.

2. Job execution

Similar to other rgw deferred work threads (such as gc, lifecycle, and
resharding), the jobs thread will use rados object locks to coordinate
execution with other radosgw processes in the same zone. The thread
then will iterate over the jobs index and fetch the jobs that are
current and start executing these.

A job can be cancelled until it was executed, we'll probably want the
master to lead here (so that we can know whether a job was cancelled
successfully or not). We can add another flag that the master will set
on the job meta object, which will signify that a job has started
executing (thus allowing all other zones to start executing the job).

The problem in multi-site is that operations that happen on the master
affect the other zones. As an example, let's take a user removal job
that is scheduled on the master. A job metadata object is generated on
the master, which triggers the creation of an entry in the jobs index
at the master zone. The job metadata object is synced to other zones,
and it generates jobs entries in the respective jobs indexes of each
zone. Then once the time for that job to execute it starts executeing
on the master and then on all other zones. This can lead to a case
where the master already removed its local data and metadata for that
user, while other zones still haven't. A user metadata object removal
at the master zone will then trigger a sync operation of this object's
removal (as well as removal of all the bucket info metadata objects
for all the buckets that the user had) that will be sent to all other
zones. This can lead to a case where we'd lose the metadata objects
for the user and for all its buckets in a zone that hasn't removed the
actual user's data. This is a problem since at that point rgw will not
be able tor figure out what data needs to be removed.
One way to deal with it is by modifying the metadata sync to not
remove the user and bucket instance meta objects directly, but rather
create a local job to do that (that will also include data purging),
or just defer to an existing job (if possible). My issue with this
solution is that it complicates the state of the metadata objects
sync.
Maybe instead of deferring the removal of the user/bucket metadata
object, generate a mutated copy of them (using a different name) that
will later be used for data purge.

Any thoughts?

Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html