Re: Teuthology & Rook (& DeepSea, ceph-ansible, ...)

Alfredo Deza <adeza@xxxxxxxxxx> · Thu, 25 Apr 2019 08:22:49 -0400

On Wed, Apr 24, 2019 at 4:35 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>
> On Wed, Apr 24, 2019 at 11:11 AM Alfredo Deza <adeza@xxxxxxxxxx> wrote:
> >
> > On Wed, Apr 24, 2019 at 10:50 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> > >
> > > Hello Travis, all,
> > > I’ve been looking at the interfaces our ceph-qa-suite tasks expect
> > > from the underlying teuthology and Ceph deployment tasks to try and
> > > 1) narrow them down into something we can implement against other
> > > backends (ceph-ansible, Rook, DeepSea, etc)
> > > 2) see how those interfaces need to be adapted to suit the differences
> > > between physical hosts and kubernetes pods.
> >
> > I would like to see that not coupled at all. Why does teuthology need
> > to know about these? It would be really interesting to see a framework
> > that can
> > test against a cluster - regardless of how that cluster got there (or
> > if its based on containers or baremetal)
>
> Taking over an existing cluster without any knowledge of its setup
> isn't practical because we need to manipulate it pretty intrusively in
> order to perform our testing.

You are assuming that by decoupling teuthology from interfacing with
the deployment type directly (ceph-ansible/rook/deepsea) it will not
know about the cluster.

What I mean is that teuthology *does not need to know how to deploy
with N systems*. A typical test would be:

1. Something (Jenkins job, Travis CI, Zuul) deploys a ceph cluster to
achieve a certain state using the deployment type it wants (e.g.
ceph-ansible)
2. The test framework with a given test suite that is supposed to run
against the state achieved in #1 is called to interact with the
cluster

There is no reason that 1 and 2 need to be in the same "framework",
just like Python's unittest doesn't need to know how to setup
PostgreSQL to run tests against it.

Today these things *are not* decoupled and I believe that to be one of
the main reasons why it is so complicated to deal with, extend, fix,
and maintain overall.

> I believe some of our tests involve
> turning off an OSD, deliberately breaking its on-disk store, and
> turning it back on again. We certainly do plenty of SIGABRT and other
> things that require knowledge of the startup process/init system.

Those are things that need to be reviewed so that they can be applied
to *a ceph cluster* not a *ceph cluster that teuthology deployed and
controls with internal mechanisms*

> That said, I am indeed trying to get to the point where the
> ceph-qa-suite tasks do not have any idea how the cluster came to exist
> and are just working through a defined interface. The existing
> "install" and "ceph" tasks are the pieces that set up that interface,
> and the ssh-based run() function is the most widespread "bad" part of
> the existing interface, so those are the parts I'm focused on fixing
> or working around right now. Plus we've learned from experience we
> want to include testing those installers and init systems within our
> tests...

I disagree here. These do not need fixing, they need to be ripped out.

>
> > > So I’d like to know how this all sounds. In particular, how
> > > implausible is it that we can ssh into Ceph containers and execute
> > > arbitrary shell commands?
> >
> > That is just not going to work in the way teuthology operates. Poking
> > at things inside a container depends on the deployment type, for
> > example, docker would do something like
> > `docker exec` while kubernetes (and openshift) does it a bit differently.
> >
> > You can't just ssh.
>
> Yes, and for things like invoking Ceph or samba daemons we have good
> interfaces to abstract that out. But for things like "run this python
> script I've defined in-line to scrape up a piece of data I care about"
> there aren't any practical replacements.

There are practical replacements for this - it will take some effort
to reason about tests in a different way as to how they exist today
(tightly coupled).

For ceph-medic, we can execute Python on a remote node, do some
processing, and return results back - regardless of the type of
cluster (containers, rook, openshift, kubernetes, baremetal).

> We can move away from doing
> that, but I'd like to explore what our options are before I commit to
> either 1) re-writing all of that code or 2) turn off every one of
> those tests, as a precondition of testing in Rook.

I am hoping that re-writing or extending to *explicitly* tie the
framework into an implementation like rook is going to end up causing
the same problems you are trying to solve today.

> Anyway I really haven't done much with Kubernetes and I didn't realize
> you could just get a shell out of it (I thought it fought pretty hard
> to *prevent* that...) so I'll spend some more time looking at it.
> -Greg