On Wed, Apr 24, 2019 at 4:35 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > > On Wed, Apr 24, 2019 at 11:11 AM Alfredo Deza <adeza@xxxxxxxxxx> wrote: > > > > On Wed, Apr 24, 2019 at 10:50 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > > > > > > Hello Travis, all, > > > I’ve been looking at the interfaces our ceph-qa-suite tasks expect > > > from the underlying teuthology and Ceph deployment tasks to try and > > > 1) narrow them down into something we can implement against other > > > backends (ceph-ansible, Rook, DeepSea, etc) > > > 2) see how those interfaces need to be adapted to suit the differences > > > between physical hosts and kubernetes pods. > > > > I would like to see that not coupled at all. Why does teuthology need > > to know about these? It would be really interesting to see a framework > > that can > > test against a cluster - regardless of how that cluster got there (or > > if its based on containers or baremetal) > > Taking over an existing cluster without any knowledge of its setup > isn't practical because we need to manipulate it pretty intrusively in > order to perform our testing. You are assuming that by decoupling teuthology from interfacing with the deployment type directly (ceph-ansible/rook/deepsea) it will not know about the cluster. What I mean is that teuthology *does not need to know how to deploy with N systems*. A typical test would be: 1. Something (Jenkins job, Travis CI, Zuul) deploys a ceph cluster to achieve a certain state using the deployment type it wants (e.g. ceph-ansible) 2. The test framework with a given test suite that is supposed to run against the state achieved in #1 is called to interact with the cluster There is no reason that 1 and 2 need to be in the same "framework", just like Python's unittest doesn't need to know how to setup PostgreSQL to run tests against it. Today these things *are not* decoupled and I believe that to be one of the main reasons why it is so complicated to deal with, extend, fix, and maintain overall. > I believe some of our tests involve > turning off an OSD, deliberately breaking its on-disk store, and > turning it back on again. We certainly do plenty of SIGABRT and other > things that require knowledge of the startup process/init system. Those are things that need to be reviewed so that they can be applied to *a ceph cluster* not a *ceph cluster that teuthology deployed and controls with internal mechanisms* > That said, I am indeed trying to get to the point where the > ceph-qa-suite tasks do not have any idea how the cluster came to exist > and are just working through a defined interface. The existing > "install" and "ceph" tasks are the pieces that set up that interface, > and the ssh-based run() function is the most widespread "bad" part of > the existing interface, so those are the parts I'm focused on fixing > or working around right now. Plus we've learned from experience we > want to include testing those installers and init systems within our > tests... I disagree here. These do not need fixing, they need to be ripped out. > > > > So I’d like to know how this all sounds. In particular, how > > > implausible is it that we can ssh into Ceph containers and execute > > > arbitrary shell commands? > > > > That is just not going to work in the way teuthology operates. Poking > > at things inside a container depends on the deployment type, for > > example, docker would do something like > > `docker exec` while kubernetes (and openshift) does it a bit differently. > > > > You can't just ssh. > > Yes, and for things like invoking Ceph or samba daemons we have good > interfaces to abstract that out. But for things like "run this python > script I've defined in-line to scrape up a piece of data I care about" > there aren't any practical replacements. There are practical replacements for this - it will take some effort to reason about tests in a different way as to how they exist today (tightly coupled). For ceph-medic, we can execute Python on a remote node, do some processing, and return results back - regardless of the type of cluster (containers, rook, openshift, kubernetes, baremetal). > We can move away from doing > that, but I'd like to explore what our options are before I commit to > either 1) re-writing all of that code or 2) turn off every one of > those tests, as a precondition of testing in Rook. I am hoping that re-writing or extending to *explicitly* tie the framework into an implementation like rook is going to end up causing the same problems you are trying to solve today. > Anyway I really haven't done much with Kubernetes and I didn't realize > you could just get a shell out of it (I thought it fought pretty hard > to *prevent* that...) so I'll spend some more time looking at it. > -Greg