Re: Ceph and management tools interaction proposal

Ricardo Dias <rdias@xxxxxxxx> · Wed, 21 Feb 2018 07:45:39 +0000

On 19-02-2018 17:32, John Spray wrote:
On Mon, Feb 19, 2018 at 8:35 AM, Ricardo Dias <rdias@xxxxxxxx> wrote:
Hi,

I would like to start a discussion about the interaction between Ceph
and the several management/deployment tools (MTs from herein) like
kubernetes, salt, or ansible. (I couldn't find such discussion in the
ceph-devel mailing list, and therefore I'm sorry if this was something
that was already discussed).

Since this project started looking at containerized environments by
using a kubernetes/rook based solution for deploying and managing a ceph
cluster, I believe that one of the main ideas is to make the ceph-mgr
be able to "talk" with k8s/rook in order to give Ceph the power of
controlling its own services, like adding/removing OSDs and other
similar kinds of operations.

                         +----------------------+
                         |  Cluster             |
+----------+            |                      |
|   k8s    |            |  +----------+        |
|          |<-----------+--|   Ceph   |        |
|          |            |  +----------+        |
|   rook   |----------->|                      |
+----------+            |  +----------------+  |
                         |  | other services |  |
                         |  +----------------+  |
                         +----------------------+

When I first saw this architecture proposal (in a presentation made by
Blaine Gardner) the first thing I noticed was that this architecture,
could be generalized not only to use k8s/rook, but to also use other
MTs. For instance, instead of using the pair k8s/rook, we could use
salt/deepsea, or ansible/ceph-ansible to perform the same
deployment/management tasks.

                                  +----------------------+
                                  |  Cluster             |
+-------------------+            |                      |
|  k8s/salt/ansible |            |  +----------+        |
|                   |<-----------+--| Ceph     |        |
|                   |            |  +----------+        |
| rook/deepsea/     |----------->|                      |
|    ceph-ansible   |            |                      |
+-------------------+            |  +----------------+  |
                                  |  | other services |  |
                                  |  +----------------+  |
                                  +----------------------+

Until now, we have been using MTs in one direction only, the user uses an
MT to deploy/manage Ceph. But since the introduction of ceph-mgr, and
more recently with the development of a WebUI dashboard, having the
capability of performing/controlling deployment/management operations
from ceph-mgr, or any module running inside the ceph-mgr, requires that
Ceph can communicate with these MTs.

This communication between Ceph and MTs can be achieved using two
approaches:

a) MTs provide their own API (different for each kind of MT) and Ceph
implements the support to talk to each one of the APIs;

b) Ceph designs/controls an API schema (let's call it Ceph-Mgt API) that
allows Ceph to perform deployment/management tasks, and each MT provide
an implementation of that API schema.

This is a very relevant topic, thank you for starting the discussion.

IMO approach b) would be the way to go for several reasons:

* Ceph would not be dependent on any MT kind/version/functionality
(either the MT provides the Ceph-Mgt API implementation or not)
* Development of the high-level deployment/management operations would
be faster/reliable because only one API exists
* Deployment heuristics (already implemented in MTs, like automatic
allocation of OSDs to disks for instance) could actually be implemented
inside the Ceph code base and become centralized.

The cons I see for approach b) are:
* it's hard to design such API schema beforehand (but that is why I'm
starting this discussion)
* if none MT provides an implementation for the API, ceph-mgr will not
be able to do anything (this con is highly unlikely to happen, since we can
always contribute to the MT project, as we already do)

Now some concrete aspects of approach b):

Usually MTs already provide a non-pure (for purists) REST interface that
can be extended. I know that salt has this (this is what is used for
communication between openATTIC and DeepSea), k8s also has, and I'm not
aware of such thing for ansible, but some simple webserver could be
placed in front of ansible to provide the same kind of REST interface.

We definitely do need to define ways to plug new functionality into
diverse backends (i.e. things other than k8s), but I'm not sure the
(b) remote API is the right place to do it, vs. the (a) piece of local
code in Ceph.

Having a common REST API is a neat idea *if* you can get all the
platforms of interest to implement it identically.  However, I'm
really doubtful about that assumption:
  - Even if e.g. Rook and DeepSea both had the same logical API, the
details would be different (Salt/k8s API frameworks presumably have
different auth, different conventions on pagination, etc)
  - When a tool like Ansible Tower has a REST API, that does not mean
that playbooks can expose an arbitrary API: whatever drives the Tower
API has to understand that it is operating in terms of playbooks etc.
  - If we had a "fallback" backend that just SSH'd out to baremetal
hosts to do the basics, it would need a whole API service built around
it.

If I'm right about that point, and getting identical APIs across tools
is unrealistic, then you end up with intermediary services everywhere.
I don't think we can reasonably suggest to a user that they should
have Rook, Ceph, *and* a third service that connects Rook and Ceph.
The reasonable thing to do in that situation is rather to have code
inside Ceph that knows how to talk to Rook -- that's the (a) option
rather than the (b) option.

The situation today is that the tools already do have their own APIs
(we're halfway to the (a) solution already), so the question is
whether the code for driving those MT APIs in a uniform way should
live inside Ceph, or whether it should be in another service with a
uniform remote API layer on top of it.  In that context, I strongly
prefer not to add another remote API layer.

I agree with your points. Having all MTs to provide the same APIs is 
something very hard to achieve, or even *unrealistic*. And therefore, as 
you stated, if we need an intermediate layer to translate the Ceph-Mgt 
API into the MT API, then this should be inside Ceph.

To be clear, I am very much in favour of having an abstraction layer:
I just think the abstraction layer would be best implemented on the
Ceph side, as some python code, rather than as a collection of
externally run proxy/translator services.  If Rook and DeepSea end up
with extremely similar APIs to the point that they can be converged,
then that would be awesome, and I'm happy to be proven wrong.

Right, having an abstraction layer between ceph-mgr modules and MTs is 
something we should aim for, and for the reasons you state above, this 
layer should be implemented in Ceph.

MTs could register in the ceph manager service map with the URL that
provides the Ceph-Mgt API implementation. Then any ceph-mgr module could
check the availability of such implementation, and use it.

The Ceph-Mgt API should enforce a high-level abstract representation of
the cluster, in order to work in containerized and bare-metal
environments.

I see two possible approaches here:
  - build an interface designed for containerized environments, and
enable baremetal environments to do their best to implement it
  - or, have an interface with some container-only bits and some
baremetal-only bits.

My default is for the first approach, but I'm sympathetic that some
people will have functionality in non-container backends that they
want to expose, so that would lead to the second.

Yes, I think we will start with the first approach, but we may evolve 
into the second approach.

For instance, the Ceph-Mgt API should specify methods for list the
available resources of a cluster, where by resources I mean
computational devices (hosts), disks, networks. Each resource may have
several properties like the amount of RAM a computational device has, or
the size of a disk, etc. The resources could be related to each other, a
set of disk resources can be related to a single computational device.

The Ceph-Mgt API could also specify the notion of service, where a
service could be an OSD/MON/MGR daemon, an NFS-ganesha daemon, or any
other software component that runs inside the cluster.
The API should provide methods to start/configure/stop services, and
these services should be associated with the cluster resources described
in the previous paragraph.

The individual service level may not be the right abstraction for
driving the stateless daemons: perhaps something more like a
kubernetes Deployment object that requests a certain number of
services (pods) running with the same configuration.

This is the kind of problem that we will only know how to solve it after 
implementing and experimenting. It's hard to know what is the right 
abstraction beforehand. And yes, we should start with something similar 
to what k8s does (since is our first supported MT), and then we can 
always refine.

Another important feature of the Ceph-Mgt API would be to provide a
stream based connection between the MT and Ceph so that Ceph could
listen for events from the MT, and also to support asynchronous
operations in a more efficient way.

We do need this stream of notifications from the MT to Ceph (my
favourite simple example is knowing when a k8s MDS pod fails, so that
I can call "ceph mds fail" on the rank).  This would be an area where
using existing remote APIs is nice, because k8s already has its
"watch" API.

Salt also provides a "watch" API where we can listen for all events that 
pass in the salt event bus.

I think with this approach the techniques used for the deployment and
management of Ceph would be mostly under the Ceph ecosystem/project and
can avoid deviations that currently exist in each MT project.

What do you all think about this?

The most important part currently is defining what this set of common
operations actually is.  I hope to circulate some prototype
Rook-driving code soon, which would be a good opportunity to discuss
which bits would map to other backends and how.  Ultimately, if we
have some in-Ceph python code that we later decide to kick out into
remote services with remote APIs, that wouldn't be too bad of a
situation.

I'm happy to help in writing, or review, some code. I have experience 
with Salt, so it will be nice to check how the rook-driving code could 
be extend to support Salt (or see what Salt modules need to be written 
to match the code expectations).

Thanks for all your comments.

Cheers,
John

Thanks,
--
Ricardo Dias
Senior Software Engineer - Storage Team
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284
(AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Ricardo Dias
Senior Software Engineer - Storage Team
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284
(AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html