Re: Teuthology & Rook (& DeepSea, ceph-ansible, ...)

Vasu Kulkarni <vakulkar@xxxxxxxxxx> · Thu, 25 Apr 2019 07:46:32 -0700

On Thu, Apr 25, 2019 at 6:55 AM Sage Weil <sweil@xxxxxxxxxx> wrote:
>
> On Thu, 25 Apr 2019, Alfredo Deza wrote:
> > On Wed, Apr 24, 2019 at 4:35 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> > >
> > > On Wed, Apr 24, 2019 at 11:11 AM Alfredo Deza <adeza@xxxxxxxxxx> wrote:
> > > >
> > > > On Wed, Apr 24, 2019 at 10:50 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> > > > >
> > > > > Hello Travis, all,
> > > > > I’ve been looking at the interfaces our ceph-qa-suite tasks expect
> > > > > from the underlying teuthology and Ceph deployment tasks to try and
> > > > > 1) narrow them down into something we can implement against other
> > > > > backends (ceph-ansible, Rook, DeepSea, etc)
> > > > > 2) see how those interfaces need to be adapted to suit the differences
> > > > > between physical hosts and kubernetes pods.
> > > >
> > > > I would like to see that not coupled at all. Why does teuthology need
> > > > to know about these? It would be really interesting to see a framework
> > > > that can
> > > > test against a cluster - regardless of how that cluster got there (or
> > > > if its based on containers or baremetal)
> > >
> > > Taking over an existing cluster without any knowledge of its setup
> > > isn't practical because we need to manipulate it pretty intrusively in
> > > order to perform our testing.
> >
> > You are assuming that by decoupling teuthology from interfacing with
> > the deployment type directly (ceph-ansible/rook/deepsea) it will not
> > know about the cluster.
> >
> > What I mean is that teuthology *does not need to know how to deploy
> > with N systems*. A typical test would be:
> >
> > 1. Something (Jenkins job, Travis CI, Zuul) deploys a ceph cluster to
> > achieve a certain state using the deployment type it wants (e.g.
> > ceph-ansible)
> > 2. The test framework with a given test suite that is supposed to run
> > against the state achieved in #1 is called to interact with the
> > cluster
> >
> > There is no reason that 1 and 2 need to be in the same "framework",
> > just like Python's unittest doesn't need to know how to setup
> > PostgreSQL to run tests against it.
> >
> > Today these things *are not* decoupled and I believe that to be one of
> > the main reasons why it is so complicated to deal with, extend, fix,
> > and maintain overall.
>
> That's true, but I don't think fixing this is practical without
> rewriting most of the tests as well.  Test cases are generally carefully
> crafted to install the cluster in a particular way, with a
> particular set of settings, number of daemons, and so on, in order
> to test a particular behavior.
>
> IMO it is a general failing that teuthology wasn't crafted in a
> deployer-agnostic way.  However, it would be a huge investment of effort
> to fix that.  I don't think it's the best time investment right now.  I
> would focus instead on testing the provisionier(s) we care about with teh
> tools that are most appropriate for those provisioners, and leave
> teuthology to do what it has become quite good at--testing core ceph.
>
> IMO eventually we should clean this up.  First, by merging Zach's ~2 year
> old systemd patch, and then rewriting ceph.py to provision using the

We are not that horrible :), It was merged on Oct 18, 2017 including
fixes for ceph.py (It took time though )
https://github.com/ceph/teuthology/blob/master/teuthology/orchestra/daemon/systemd.py
and
https://github.com/ceph/teuthology/blob/master/teuthology/orchestra/daemon/group.py#L15-L17

Teuthology has ceph-deploy/ceph-ansible and systemd routines for tests
to utilize,
but the Leads should also take steps to utilize and fix the suites
cluster here - blaming dev leads too here :)
There might be still some minor issues but they can be all fixed easily.

> emerging ceph bootstrap process + orchestration API.  But I'm still not
> convinced adding a broad set of abstractions so that we can plug arbitrary
> deployment tools in is worth the effort.
>
> sage
>
>
>  >
> > > I believe some of our tests involve
> > > turning off an OSD, deliberately breaking its on-disk store, and
> > > turning it back on again. We certainly do plenty of SIGABRT and other
> > > things that require knowledge of the startup process/init system.
> >
> > Those are things that need to be reviewed so that they can be applied
> > to *a ceph cluster* not a *ceph cluster that teuthology deployed and
> > controls with internal mechanisms*
> >
> > > That said, I am indeed trying to get to the point where the
> > > ceph-qa-suite tasks do not have any idea how the cluster came to exist
> > > and are just working through a defined interface. The existing
> > > "install" and "ceph" tasks are the pieces that set up that interface,
> > > and the ssh-based run() function is the most widespread "bad" part of
> > > the existing interface, so those are the parts I'm focused on fixing
> > > or working around right now. Plus we've learned from experience we
> > > want to include testing those installers and init systems within our
> > > tests...
> >
> > I disagree here. These do not need fixing, they need to be ripped out.
> >
> > >
> > > > > So I’d like to know how this all sounds. In particular, how
> > > > > implausible is it that we can ssh into Ceph containers and execute
> > > > > arbitrary shell commands?
> > > >
> > > > That is just not going to work in the way teuthology operates. Poking
> > > > at things inside a container depends on the deployment type, for
> > > > example, docker would do something like
> > > > `docker exec` while kubernetes (and openshift) does it a bit differently.
> > > >
> > > > You can't just ssh.
> > >
> > > Yes, and for things like invoking Ceph or samba daemons we have good
> > > interfaces to abstract that out. But for things like "run this python
> > > script I've defined in-line to scrape up a piece of data I care about"
> > > there aren't any practical replacements.
> >
> > There are practical replacements for this - it will take some effort
> > to reason about tests in a different way as to how they exist today
> > (tightly coupled).
> >
> > For ceph-medic, we can execute Python on a remote node, do some
> > processing, and return results back - regardless of the type of
> > cluster (containers, rook, openshift, kubernetes, baremetal).
> >
> > > We can move away from doing
> > > that, but I'd like to explore what our options are before I commit to
> > > either 1) re-writing all of that code or 2) turn off every one of
> > > those tests, as a precondition of testing in Rook.
> >
> > I am hoping that re-writing or extending to *explicitly* tie the
> > framework into an implementation like rook is going to end up causing
> > the same problems you are trying to solve today.
> >
> > > Anyway I really haven't done much with Kubernetes and I didn't realize
> > > you could just get a shell out of it (I thought it fought pretty hard
> > > to *prevent* that...) so I'll spend some more time looking at it.
> > > -Greg
> >
> >