Hello Travis, all, I’ve been looking at the interfaces our ceph-qa-suite tasks expect from the underlying teuthology and Ceph deployment tasks to try and 1) narrow them down into something we can implement against other backends (ceph-ansible, Rook, DeepSea, etc) 2) see how those interfaces need to be adapted to suit the differences between physical hosts and kubernetes pods. Some very brief background about teuthology: it expects you to select a group of hosts (eg smithi001, smithi002, to map those hosts to specific roles (eg a host with osd.1, mon.a, client.0 and another with osd.2, mon.b, client.1, client.2), and to then run specific tasks against those configurations (eg install, ceph, kclient, fio). (Those following along at home who want more details may wish to view one of the talks I’ve given on teuthology, eg https://www.youtube.com/watch?v=gj1OXrKdSrs .) The touch points between a ceph-qa-suite task and the remote hardware are actually not a very large interface in direct function terms, but some of the functions are very large themselves so we’ll need to rework them a bit. I’ve taken pretty extensive notes at https://pad.ceph.com/p/teuthology-rook, but I’ll summarize here. The important touch points are 1) the “install” task, 2) the “ceph” task, and 3) the “RemoteProcess” abstraction. The install task (https://github.com/ceph/teuthology/blob/master/teuthology/task/install/__init__.py) is actually not too hard in terms of follow-on tasks. Its job is simply to get the system ready for any following tasks. In raw teuthology/ceph-qa-suite this includes installing the Ceph packages from shaman, plus any other special pieces we need from our own builds or the default distribution (Samba, python3, etc). Presumably for Rook this would mean setting up Kubernetes (Vasu has a PR enabling that in teuthology at https://github.com/ceph/teuthology/pull/1262) — or perhaps pointing at an existing cluster — and setting configurations so that Rook would install container images reflecting the Ceph build we want to test instead of its defaults. (I’m sure these are all very big tasks that I’m skipping over, but I want to focus on the teuthology/qa-suite interfaces for now.) The ceph task itself (https://github.com/ceph/ceph/blob/master/qa/tasks/ceph.py) is pretty large and supports a big set of functionality. It’s responsible for actually turning on the Ceph cluster, cleaning up when the test is over, and providing some validation. This includes stuff like running with valgrind, options to make sure the cluster goes healthy or scrubs at the end of a test, checking for issues in the logs, etc. However, most of that stuff can be common code once we have the right interfaces. The parts that get shared out to other tasks are 1) functions to stop and restart specific daemons, 2) functions to check if a cluster is healthy and to wait for failures, 3) the “task” function that serves to actually start up the Ceph cluster, and most importantly 4) exposing a “DaemonGroup” that links to the “RemoteProcess” representing each Ceph daemon in the system. I presume 1-3 are again not too complicated to map onto Rook commands we can get at programmatically. The most interesting part of this interface, and of the teuthology model more generally, is the RemoteProcess. Teuthology was created to interface with machines via a module called “orchestra” (https://github.com/ceph/teuthology/tree/master/teuthology/orchestra) that wraps SSH connections to remote nodes. That means you can invoke “remote.run” on host objects that passes a literal shell command and get back a RemoteProcess object (https://github.com/ceph/teuthology/blob/master/teuthology/orchestra/run.py#L21) representing it. On that RemoteProcess you can wait() until it’s done and/or look at the exitstatus(), you can query if it’s finished() running. And you can access the stdin, stdout, and stderr channels! Most of this usage tends to fall into a few patterns: stdout is used to get output, stderr is mostly used for prettier error output in the logs, and stdin is used in a few places for input but is mostly used as a signal to tasks to shut down when the channel closes. It’s definitely possible to define all those options as higher-level interfaces and that’s probably the eventual end goal, but it’ll be a hassle to convert all the existing tests up front. So I’d like to know how this all sounds. In particular, how implausible is it that we can ssh into Ceph containers and execute arbitrary shell commands? Is there a good replacement interface for most of what I’ve described above? While a lot of the role-to-host mapping doesn’t matter, in a few test cases it is critical — is there a good way to deal with that (are tags flexible enough for us to force this model through)? Anybody have any other thoughts I’ve missed out on? -Greg