Re: Teuthology & Rook (& DeepSea, ceph-ansible, ...)

Vasu Kulkarni <vakulkar@xxxxxxxxxx> · Wed, 24 Apr 2019 14:37:17 -0700



On Wed, Apr 24, 2019 at 2:28 PM Sage Weil <sweil@xxxxxxxxxx> wrote:
>
> On Wed, 24 Apr 2019, Vasu Kulkarni wrote:
> > On Wed, Apr 24, 2019 at 11:35 AM Sage Weil <sweil@xxxxxxxxxx> wrote:
> > >
> > > On Wed, 24 Apr 2019, Alfredo Deza wrote:
> > > > On Wed, Apr 24, 2019 at 10:50 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> > > > >
> > > > > Hello Travis, all,
> > > > > I’ve been looking at the interfaces our ceph-qa-suite tasks expect
> > > > > from the underlying teuthology and Ceph deployment tasks to try and
> > > > > 1) narrow them down into something we can implement against other
> > > > > backends (ceph-ansible, Rook, DeepSea, etc)
> > > > > 2) see how those interfaces need to be adapted to suit the differences
> > > > > between physical hosts and kubernetes pods.
> > > >
> > > > I would like to see that not coupled at all. Why does teuthology need to
> > > > know about these? It would be really interesting to see a framework that
> > > > can test against a cluster - regardless of how that cluster got there
> > > > (or if its based on containers or baremetal)
> > >
> > > 100% agree.  I think trying to couple teuthology with kubernetes and rook
> > > as a great way to waste 6+ months of time discussing interfaces and
> > > debating approaches without actually delivering any actual tests.
> > >
> > > IMO we should test rook(+ceph) with tools meant for kubernetes, e.g. with
> > > prow
> > >
> > > https://github.com/kubernetes/test-infra/tree/master/prow
> > I have looked at prow before and it is basically is a github
> > integration tool " Prow provides GitHub automation in the form of
> > policy enforcement, chat-ops via /foo style commands, and automatic PR
> > merging" , If its nicely decoupled even ceph should be able to use it
> > without worrying about what the underlying test infrastructure code
> > is, In some of the videos its mentioned that GCE is *must* to use it
> > since it stores test artifacts on s3 like interface.
> >
> > All the kubernetes tests are in Go lang, so If one has to contribute
> > to the same test-infra as kubernetes then it has to be in *go* lang
> > and learn what libraries exist that can help speed up ceph
> > testing(prow is definitely upper layer not useful here - it runs on
> > kube so gets the advantages of HA),  For us teuthology can still serve
> > some testing of rook with K8s because the existing libraries can be
> > consumed, but it is already complex.  It boils down to what's the
> > scope of rook testing(go deep into ceph or test deployment only and
> > rely on ceph binary testing outside rook or combination of one of
> > those).
>
> My assumption is that the rook tests should focus on interaction with
> kubernetes and openshift.  Superficially, prow sounds like the right tool
> to test various kubernetes features, a range of kubernetes versions, CSI,
> manipuluation of the rook CRDs and verification that they are correctly
> expressed as changes in the contorlled ceph cluster, etc.
>
> I don't think there's any reason to reimplement the deep ceph testing that
> we do in teuthology using a different framework.  Teuthology is great at
> testing core ceph, and the rook tests really don't need to worry about
> that.
Cool, Thanks for the confirmation.

So far we have a terrible track record making teuthology test
> anything else (even systemd units!).
True :)

 This seems like a clear case where
> there are diverging testing goals and the same tool need not be used for
> both of them.
>
> sage
>
>
>
>
> >
> > >
> > > sage
> > >
> > >
> > >
> > > >
> > > > >
> > > > > Some very brief background about teuthology: it expects you to select
> > > > > a group of hosts (eg smithi001, smithi002, to map those hosts to
> > > > > specific roles (eg a host with osd.1, mon.a, client.0 and another with
> > > > > osd.2, mon.b, client.1, client.2), and to then run specific tasks
> > > > > against those configurations (eg install, ceph, kclient, fio).  (Those
> > > > > following along at home who want more details may wish to view one of
> > > > > the talks I’ve given on teuthology, eg
> > > > > https://www.youtube.com/watch?v=gj1OXrKdSrs .)
> > > > >
> > > > > The touch points between a ceph-qa-suite task and the remote hardware
> > > > > are actually not a very large interface in direct function terms, but
> > > > > some of the functions are very large themselves so we’ll need to
> > > > > rework them a bit. I’ve taken pretty extensive notes at
> > > > > https://pad.ceph.com/p/teuthology-rook, but I’ll summarize here.
> > > > >
> > > > > The important touch points are 1) the “install” task, 2) the “ceph”
> > > > > task, and 3) the “RemoteProcess” abstraction.
> > > > >
> > > > > The install task
> > > > > (https://github.com/ceph/teuthology/blob/master/teuthology/task/install/__init__.py)
> > > > > is actually not too hard in terms of follow-on tasks. Its job is
> > > > > simply to get the system ready for any following tasks. In raw
> > > > > teuthology/ceph-qa-suite this includes installing the Ceph packages
> > > > > from shaman, plus any other special pieces we need from our own builds
> > > > > or the default distribution (Samba, python3, etc). Presumably for Rook
> > > > > this would mean setting up Kubernetes (Vasu has a PR enabling that in
> > > > > teuthology at https://github.com/ceph/teuthology/pull/1262) — or
> > > > > perhaps pointing at an existing cluster — and setting configurations
> > > > > so that Rook would install container images reflecting the Ceph build
> > > > > we want to test instead of its defaults. (I’m sure these are all very
> > > > > big tasks that I’m skipping over, but I want to focus on the
> > > > > teuthology/qa-suite interfaces for now.)
> > > > >
> > > > > The ceph task itself
> > > > > (https://github.com/ceph/ceph/blob/master/qa/tasks/ceph.py) is pretty
> > > > > large and supports a big set of functionality. It’s responsible for
> > > > > actually turning on the Ceph cluster, cleaning up when the test is
> > > > > over, and providing some validation. This includes stuff like running
> > > > > with valgrind, options to make sure the cluster goes healthy or scrubs
> > > > > at the end of a test, checking for issues in the logs, etc. However,
> > > > > most of that stuff can be common code once we have the right
> > > > > interfaces. The parts that get shared out to other tasks are 1)
> > > > > functions to stop and restart specific daemons, 2) functions to check
> > > > > if a cluster is healthy and to wait for failures, 3) the “task”
> > > > > function that serves to actually start up the Ceph cluster, and most
> > > > > importantly 4) exposing a “DaemonGroup” that links to the
> > > > > “RemoteProcess” representing each Ceph daemon in the system. I presume
> > > > > 1-3 are again not too complicated to map onto Rook commands we can get
> > > > > at programmatically.
> > > > >
> > > > > The most interesting part of this interface, and of the teuthology
> > > > > model more generally, is the RemoteProcess. Teuthology was created to
> > > > > interface with machines via a module called “orchestra”
> > > > > (https://github.com/ceph/teuthology/tree/master/teuthology/orchestra)
> > > > > that wraps SSH connections to remote nodes. That means you can invoke
> > > > > “remote.run” on host objects that passes a literal shell command and
> > > > > get back a RemoteProcess object
> > > > > (https://github.com/ceph/teuthology/blob/master/teuthology/orchestra/run.py#L21)
> > > > > representing it. On that RemoteProcess you can wait() until it’s done
> > > > > and/or look at the exitstatus(), you can query if it’s finished()
> > > > > running. And you can access the stdin, stdout, and stderr channels!
> > > > > Most of this usage tends to fall into a few patterns: stdout is used
> > > > > to get output, stderr is mostly used for prettier error output in the
> > > > > logs, and stdin is used in a few places for input but is mostly used
> > > > > as a signal to tasks to shut down when the channel closes.
> > > > >
> > > > > It’s definitely possible to define all those options as higher-level
> > > > > interfaces and that’s probably the eventual end goal, but it’ll be a
> > > > > hassle to convert all the existing tests up front.
> > > > >
> > > > > So I’d like to know how this all sounds. In particular, how
> > > > > implausible is it that we can ssh into Ceph containers and execute
> > > > > arbitrary shell commands?
> > > >
> > > > That is just not going to work in the way teuthology operates. Poking
> > > > at things inside a container depends on the deployment type, for
> > > > example, docker would do something like
> > > > `docker exec` while kubernetes (and openshift) does it a bit differently.
> > > >
> > > > You can't just ssh.
> > > >
> > > > Libraries like remoto [0] have all those backends implemented to
> > > > interact with nodes (regardless of what they are)
> > > >
> > > > [0] https://github.com/alfredodeza/remoto/tree/master/remoto/backends
> > > >
> > > >
> > > > >Is there a good replacement interface for
> > > > > most of what I’ve described above? While a lot of the role-to-host
> > > > > mapping doesn’t matter, in a few test cases it is critical — is there
> > > > > a good way to deal with that (are tags flexible enough for us to force
> > > > > this model through)?
> > > >
> > > > I don't know how most of those tests that have a tight dependency on
> > > > SSH work, but a shift in focus has to happen on how they are
> > > > implemented having containers in mind. For example,
> > > > it is just not going to be a good idea to attempt and manage daemons
> > > > in the foreground controlling stdin/stdout/stderr.
> > > >
> > > > Again, I would really like a better separation of items, seems like
> > > > you are proposing a bit of that already, but I would like to see a
> > > > fully decoupled framework that doesn't need to understand how to pass
> > > > arguments to ceph-deploy
> > > > or create files for ceph-ansible.
> > > >
> > > >
> > > > >
> > > > > Anybody have any other thoughts I’ve missed out on?
> > > > > -Greg
> > > >
> > > >
> >
> >
> >