On Thu, Apr 25, 2019 at 6:55 AM Sage Weil <sweil@xxxxxxxxxx> wrote: > > On Thu, 25 Apr 2019, Alfredo Deza wrote: > > On Wed, Apr 24, 2019 at 4:35 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > > > > > > On Wed, Apr 24, 2019 at 11:11 AM Alfredo Deza <adeza@xxxxxxxxxx> wrote: > > > > > > > > On Wed, Apr 24, 2019 at 10:50 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > > > > > > > > > > Hello Travis, all, > > > > > I’ve been looking at the interfaces our ceph-qa-suite tasks expect > > > > > from the underlying teuthology and Ceph deployment tasks to try and > > > > > 1) narrow them down into something we can implement against other > > > > > backends (ceph-ansible, Rook, DeepSea, etc) > > > > > 2) see how those interfaces need to be adapted to suit the differences > > > > > between physical hosts and kubernetes pods. > > > > > > > > I would like to see that not coupled at all. Why does teuthology need > > > > to know about these? It would be really interesting to see a framework > > > > that can > > > > test against a cluster - regardless of how that cluster got there (or > > > > if its based on containers or baremetal) > > > > > > Taking over an existing cluster without any knowledge of its setup > > > isn't practical because we need to manipulate it pretty intrusively in > > > order to perform our testing. > > > > You are assuming that by decoupling teuthology from interfacing with > > the deployment type directly (ceph-ansible/rook/deepsea) it will not > > know about the cluster. > > > > What I mean is that teuthology *does not need to know how to deploy > > with N systems*. A typical test would be: > > > > 1. Something (Jenkins job, Travis CI, Zuul) deploys a ceph cluster to > > achieve a certain state using the deployment type it wants (e.g. > > ceph-ansible) > > 2. The test framework with a given test suite that is supposed to run > > against the state achieved in #1 is called to interact with the > > cluster > > > > There is no reason that 1 and 2 need to be in the same "framework", > > just like Python's unittest doesn't need to know how to setup > > PostgreSQL to run tests against it. > > > > Today these things *are not* decoupled and I believe that to be one of > > the main reasons why it is so complicated to deal with, extend, fix, > > and maintain overall. > > That's true, but I don't think fixing this is practical without > rewriting most of the tests as well. Test cases are generally carefully > crafted to install the cluster in a particular way, with a > particular set of settings, number of daemons, and so on, in order > to test a particular behavior. > > IMO it is a general failing that teuthology wasn't crafted in a > deployer-agnostic way. However, it would be a huge investment of effort > to fix that. I don't think it's the best time investment right now. I > would focus instead on testing the provisionier(s) we care about with teh > tools that are most appropriate for those provisioners, and leave > teuthology to do what it has become quite good at--testing core ceph. > > IMO eventually we should clean this up. First, by merging Zach's ~2 year > old systemd patch, and then rewriting ceph.py to provision using the We are not that horrible :), It was merged on Oct 18, 2017 including fixes for ceph.py (It took time though ) https://github.com/ceph/teuthology/blob/master/teuthology/orchestra/daemon/systemd.py and https://github.com/ceph/teuthology/blob/master/teuthology/orchestra/daemon/group.py#L15-L17 Teuthology has ceph-deploy/ceph-ansible and systemd routines for tests to utilize, but the Leads should also take steps to utilize and fix the suites cluster here - blaming dev leads too here :) There might be still some minor issues but they can be all fixed easily. > emerging ceph bootstrap process + orchestration API. But I'm still not > convinced adding a broad set of abstractions so that we can plug arbitrary > deployment tools in is worth the effort. > > sage > > > > > > > I believe some of our tests involve > > > turning off an OSD, deliberately breaking its on-disk store, and > > > turning it back on again. We certainly do plenty of SIGABRT and other > > > things that require knowledge of the startup process/init system. > > > > Those are things that need to be reviewed so that they can be applied > > to *a ceph cluster* not a *ceph cluster that teuthology deployed and > > controls with internal mechanisms* > > > > > That said, I am indeed trying to get to the point where the > > > ceph-qa-suite tasks do not have any idea how the cluster came to exist > > > and are just working through a defined interface. The existing > > > "install" and "ceph" tasks are the pieces that set up that interface, > > > and the ssh-based run() function is the most widespread "bad" part of > > > the existing interface, so those are the parts I'm focused on fixing > > > or working around right now. Plus we've learned from experience we > > > want to include testing those installers and init systems within our > > > tests... > > > > I disagree here. These do not need fixing, they need to be ripped out. > > > > > > > > > > So I’d like to know how this all sounds. In particular, how > > > > > implausible is it that we can ssh into Ceph containers and execute > > > > > arbitrary shell commands? > > > > > > > > That is just not going to work in the way teuthology operates. Poking > > > > at things inside a container depends on the deployment type, for > > > > example, docker would do something like > > > > `docker exec` while kubernetes (and openshift) does it a bit differently. > > > > > > > > You can't just ssh. > > > > > > Yes, and for things like invoking Ceph or samba daemons we have good > > > interfaces to abstract that out. But for things like "run this python > > > script I've defined in-line to scrape up a piece of data I care about" > > > there aren't any practical replacements. > > > > There are practical replacements for this - it will take some effort > > to reason about tests in a different way as to how they exist today > > (tightly coupled). > > > > For ceph-medic, we can execute Python on a remote node, do some > > processing, and return results back - regardless of the type of > > cluster (containers, rook, openshift, kubernetes, baremetal). > > > > > We can move away from doing > > > that, but I'd like to explore what our options are before I commit to > > > either 1) re-writing all of that code or 2) turn off every one of > > > those tests, as a precondition of testing in Rook. > > > > I am hoping that re-writing or extending to *explicitly* tie the > > framework into an implementation like rook is going to end up causing > > the same problems you are trying to solve today. > > > > > Anyway I really haven't done much with Kubernetes and I didn't realize > > > you could just get a shell out of it (I thought it fought pretty hard > > > to *prevent* that...) so I'll spend some more time looking at it. > > > -Greg > > > >