Re: Teuthology & Rook (& DeepSea, ceph-ansible, ...)

Vasu Kulkarni <vakulkar@xxxxxxxxxx> · Wed, 24 Apr 2019 14:02:39 -0700

On Wed, Apr 24, 2019 at 11:35 AM Sage Weil <sweil@xxxxxxxxxx> wrote:
>
> On Wed, 24 Apr 2019, Alfredo Deza wrote:
> > On Wed, Apr 24, 2019 at 10:50 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> > >
> > > Hello Travis, all,
> > > I’ve been looking at the interfaces our ceph-qa-suite tasks expect
> > > from the underlying teuthology and Ceph deployment tasks to try and
> > > 1) narrow them down into something we can implement against other
> > > backends (ceph-ansible, Rook, DeepSea, etc)
> > > 2) see how those interfaces need to be adapted to suit the differences
> > > between physical hosts and kubernetes pods.
> >
> > I would like to see that not coupled at all. Why does teuthology need to
> > know about these? It would be really interesting to see a framework that
> > can test against a cluster - regardless of how that cluster got there
> > (or if its based on containers or baremetal)
>
> 100% agree.  I think trying to couple teuthology with kubernetes and rook
> as a great way to waste 6+ months of time discussing interfaces and
> debating approaches without actually delivering any actual tests.
>
> IMO we should test rook(+ceph) with tools meant for kubernetes, e.g. with
> prow
>
> https://github.com/kubernetes/test-infra/tree/master/prow
I have looked at prow before and it is basically is a github
integration tool " Prow provides GitHub automation in the form of
policy enforcement, chat-ops via /foo style commands, and automatic PR
merging" , If its nicely decoupled even ceph should be able to use it
without worrying about what the underlying test infrastructure code
is, In some of the videos its mentioned that GCE is *must* to use it
since it stores test artifacts on s3 like interface.

All the kubernetes tests are in Go lang, so If one has to contribute
to the same test-infra as kubernetes then it has to be in *go* lang
and learn what libraries exist that can help speed up ceph
testing(prow is definitely upper layer not useful here - it runs on
kube so gets the advantages of HA),  For us teuthology can still serve
some testing of rook with K8s because the existing libraries can be
consumed, but it is already complex.  It boils down to what's the
scope of rook testing(go deep into ceph or test deployment only and
rely on ceph binary testing outside rook or combination of one of
those).

>
> sage
>
>
>
> >
> > >
> > > Some very brief background about teuthology: it expects you to select
> > > a group of hosts (eg smithi001, smithi002, to map those hosts to
> > > specific roles (eg a host with osd.1, mon.a, client.0 and another with
> > > osd.2, mon.b, client.1, client.2), and to then run specific tasks
> > > against those configurations (eg install, ceph, kclient, fio).  (Those
> > > following along at home who want more details may wish to view one of
> > > the talks I’ve given on teuthology, eg
> > > https://www.youtube.com/watch?v=gj1OXrKdSrs .)
> > >
> > > The touch points between a ceph-qa-suite task and the remote hardware
> > > are actually not a very large interface in direct function terms, but
> > > some of the functions are very large themselves so we’ll need to
> > > rework them a bit. I’ve taken pretty extensive notes at
> > > https://pad.ceph.com/p/teuthology-rook, but I’ll summarize here.
> > >
> > > The important touch points are 1) the “install” task, 2) the “ceph”
> > > task, and 3) the “RemoteProcess” abstraction.
> > >
> > > The install task
> > > (https://github.com/ceph/teuthology/blob/master/teuthology/task/install/__init__.py)
> > > is actually not too hard in terms of follow-on tasks. Its job is
> > > simply to get the system ready for any following tasks. In raw
> > > teuthology/ceph-qa-suite this includes installing the Ceph packages
> > > from shaman, plus any other special pieces we need from our own builds
> > > or the default distribution (Samba, python3, etc). Presumably for Rook
> > > this would mean setting up Kubernetes (Vasu has a PR enabling that in
> > > teuthology at https://github.com/ceph/teuthology/pull/1262) — or
> > > perhaps pointing at an existing cluster — and setting configurations
> > > so that Rook would install container images reflecting the Ceph build
> > > we want to test instead of its defaults. (I’m sure these are all very
> > > big tasks that I’m skipping over, but I want to focus on the
> > > teuthology/qa-suite interfaces for now.)
> > >
> > > The ceph task itself
> > > (https://github.com/ceph/ceph/blob/master/qa/tasks/ceph.py) is pretty
> > > large and supports a big set of functionality. It’s responsible for
> > > actually turning on the Ceph cluster, cleaning up when the test is
> > > over, and providing some validation. This includes stuff like running
> > > with valgrind, options to make sure the cluster goes healthy or scrubs
> > > at the end of a test, checking for issues in the logs, etc. However,
> > > most of that stuff can be common code once we have the right
> > > interfaces. The parts that get shared out to other tasks are 1)
> > > functions to stop and restart specific daemons, 2) functions to check
> > > if a cluster is healthy and to wait for failures, 3) the “task”
> > > function that serves to actually start up the Ceph cluster, and most
> > > importantly 4) exposing a “DaemonGroup” that links to the
> > > “RemoteProcess” representing each Ceph daemon in the system. I presume
> > > 1-3 are again not too complicated to map onto Rook commands we can get
> > > at programmatically.
> > >
> > > The most interesting part of this interface, and of the teuthology
> > > model more generally, is the RemoteProcess. Teuthology was created to
> > > interface with machines via a module called “orchestra”
> > > (https://github.com/ceph/teuthology/tree/master/teuthology/orchestra)
> > > that wraps SSH connections to remote nodes. That means you can invoke
> > > “remote.run” on host objects that passes a literal shell command and
> > > get back a RemoteProcess object
> > > (https://github.com/ceph/teuthology/blob/master/teuthology/orchestra/run.py#L21)
> > > representing it. On that RemoteProcess you can wait() until it’s done
> > > and/or look at the exitstatus(), you can query if it’s finished()
> > > running. And you can access the stdin, stdout, and stderr channels!
> > > Most of this usage tends to fall into a few patterns: stdout is used
> > > to get output, stderr is mostly used for prettier error output in the
> > > logs, and stdin is used in a few places for input but is mostly used
> > > as a signal to tasks to shut down when the channel closes.
> > >
> > > It’s definitely possible to define all those options as higher-level
> > > interfaces and that’s probably the eventual end goal, but it’ll be a
> > > hassle to convert all the existing tests up front.
> > >
> > > So I’d like to know how this all sounds. In particular, how
> > > implausible is it that we can ssh into Ceph containers and execute
> > > arbitrary shell commands?
> >
> > That is just not going to work in the way teuthology operates. Poking
> > at things inside a container depends on the deployment type, for
> > example, docker would do something like
> > `docker exec` while kubernetes (and openshift) does it a bit differently.
> >
> > You can't just ssh.
> >
> > Libraries like remoto [0] have all those backends implemented to
> > interact with nodes (regardless of what they are)
> >
> > [0] https://github.com/alfredodeza/remoto/tree/master/remoto/backends
> >
> >
> > >Is there a good replacement interface for
> > > most of what I’ve described above? While a lot of the role-to-host
> > > mapping doesn’t matter, in a few test cases it is critical — is there
> > > a good way to deal with that (are tags flexible enough for us to force
> > > this model through)?
> >
> > I don't know how most of those tests that have a tight dependency on
> > SSH work, but a shift in focus has to happen on how they are
> > implemented having containers in mind. For example,
> > it is just not going to be a good idea to attempt and manage daemons
> > in the foreground controlling stdin/stdout/stderr.
> >
> > Again, I would really like a better separation of items, seems like
> > you are proposing a bit of that already, but I would like to see a
> > fully decoupled framework that doesn't need to understand how to pass
> > arguments to ceph-deploy
> > or create files for ceph-ansible.
> >
> >
> > >
> > > Anybody have any other thoughts I’ve missed out on?
> > > -Greg
> >
> >