Re: Teuthology & Rook (& DeepSea, ceph-ansible, ...)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 25 Apr 2019, Alfredo Deza wrote:
> On Wed, Apr 24, 2019 at 4:35 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> >
> > On Wed, Apr 24, 2019 at 11:11 AM Alfredo Deza <adeza@xxxxxxxxxx> wrote:
> > >
> > > On Wed, Apr 24, 2019 at 10:50 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> > > >
> > > > Hello Travis, all,
> > > > I’ve been looking at the interfaces our ceph-qa-suite tasks expect
> > > > from the underlying teuthology and Ceph deployment tasks to try and
> > > > 1) narrow them down into something we can implement against other
> > > > backends (ceph-ansible, Rook, DeepSea, etc)
> > > > 2) see how those interfaces need to be adapted to suit the differences
> > > > between physical hosts and kubernetes pods.
> > >
> > > I would like to see that not coupled at all. Why does teuthology need
> > > to know about these? It would be really interesting to see a framework
> > > that can
> > > test against a cluster - regardless of how that cluster got there (or
> > > if its based on containers or baremetal)
> >
> > Taking over an existing cluster without any knowledge of its setup
> > isn't practical because we need to manipulate it pretty intrusively in
> > order to perform our testing.
> 
> You are assuming that by decoupling teuthology from interfacing with
> the deployment type directly (ceph-ansible/rook/deepsea) it will not
> know about the cluster.
> 
> What I mean is that teuthology *does not need to know how to deploy
> with N systems*. A typical test would be:
> 
> 1. Something (Jenkins job, Travis CI, Zuul) deploys a ceph cluster to
> achieve a certain state using the deployment type it wants (e.g.
> ceph-ansible)
> 2. The test framework with a given test suite that is supposed to run
> against the state achieved in #1 is called to interact with the
> cluster
> 
> There is no reason that 1 and 2 need to be in the same "framework",
> just like Python's unittest doesn't need to know how to setup
> PostgreSQL to run tests against it.
> 
> Today these things *are not* decoupled and I believe that to be one of
> the main reasons why it is so complicated to deal with, extend, fix,
> and maintain overall.

That's true, but I don't think fixing this is practical without 
rewriting most of the tests as well.  Test cases are generally carefully 
crafted to install the cluster in a particular way, with a 
particular set of settings, number of daemons, and so on, in order 
to test a particular behavior.

IMO it is a general failing that teuthology wasn't crafted in a 
deployer-agnostic way.  However, it would be a huge investment of effort 
to fix that.  I don't think it's the best time investment right now.  I 
would focus instead on testing the provisionier(s) we care about with teh 
tools that are most appropriate for those provisioners, and leave 
teuthology to do what it has become quite good at--testing core ceph.

IMO eventually we should clean this up.  First, by merging Zach's ~2 year 
old systemd patch, and then rewriting ceph.py to provision using the 
emerging ceph bootstrap process + orchestration API.  But I'm still not 
convinced adding a broad set of abstractions so that we can plug arbitrary 
deployment tools in is worth the effort.

sage


 > 
> > I believe some of our tests involve
> > turning off an OSD, deliberately breaking its on-disk store, and
> > turning it back on again. We certainly do plenty of SIGABRT and other
> > things that require knowledge of the startup process/init system.
> 
> Those are things that need to be reviewed so that they can be applied
> to *a ceph cluster* not a *ceph cluster that teuthology deployed and
> controls with internal mechanisms*
> 
> > That said, I am indeed trying to get to the point where the
> > ceph-qa-suite tasks do not have any idea how the cluster came to exist
> > and are just working through a defined interface. The existing
> > "install" and "ceph" tasks are the pieces that set up that interface,
> > and the ssh-based run() function is the most widespread "bad" part of
> > the existing interface, so those are the parts I'm focused on fixing
> > or working around right now. Plus we've learned from experience we
> > want to include testing those installers and init systems within our
> > tests...
> 
> I disagree here. These do not need fixing, they need to be ripped out.
> 
> >
> > > > So I’d like to know how this all sounds. In particular, how
> > > > implausible is it that we can ssh into Ceph containers and execute
> > > > arbitrary shell commands?
> > >
> > > That is just not going to work in the way teuthology operates. Poking
> > > at things inside a container depends on the deployment type, for
> > > example, docker would do something like
> > > `docker exec` while kubernetes (and openshift) does it a bit differently.
> > >
> > > You can't just ssh.
> >
> > Yes, and for things like invoking Ceph or samba daemons we have good
> > interfaces to abstract that out. But for things like "run this python
> > script I've defined in-line to scrape up a piece of data I care about"
> > there aren't any practical replacements.
> 
> There are practical replacements for this - it will take some effort
> to reason about tests in a different way as to how they exist today
> (tightly coupled).
> 
> For ceph-medic, we can execute Python on a remote node, do some
> processing, and return results back - regardless of the type of
> cluster (containers, rook, openshift, kubernetes, baremetal).
> 
> > We can move away from doing
> > that, but I'd like to explore what our options are before I commit to
> > either 1) re-writing all of that code or 2) turn off every one of
> > those tests, as a precondition of testing in Rook.
> 
> I am hoping that re-writing or extending to *explicitly* tie the
> framework into an implementation like rook is going to end up causing
> the same problems you are trying to solve today.
> 
> > Anyway I really haven't done much with Kubernetes and I didn't realize
> > you could just get a shell out of it (I thought it fought pretty hard
> > to *prevent* that...) so I'll spend some more time looking at it.
> > -Greg
> 
> 

[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux