SimpleMessenger testing plan

Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx> · Tue, 14 Feb 2012 13:17:03 -0800

We on the team decided a while ago that it's past time to start
looking seriously at how we can do proper testing of more of our core
components without spinning up a full Ceph instance. We've been trying
to sneak it in as we can on new features and modules, but after some
recent experiences debugging and fixing the SimpleMessenger I got
tasked with looking at how we can implement proper module tests for
it. I did so yesterday and came up with a design outline that I'd like
to start working on when the team gets back from FAST. We will be
meeting about it as a group later and I welcome any input from the
list over the next couple of days! :)

First, we need to decide on the approach we want to take to running
these tests. Sage suggested that we might want to spin up a bunch of
Messengers and run them through a workload while randomly failing
their network operations some random fraction of the time. This
doesn't satisfy me for two reasons. First, I like to *know* that
certain scenarios have been tested, which is difficult to check with
random failure injection; second, I want to be able to check the
post-failure state of the system to make sure that we haven't leaked
resources or otherwise failed non-catastrophically. Given that, we
need to define a testing framework with very fine-grained control over
when failures are injected.

My design consists of two parts. One is a mechanism for incrementally
designating the state of the SimpleMessenger; the second is a system
for testing based on this state, syscall fault injection, and
purpose-written testing scripts.
The interface for the state designator is simple:
debug_state.set_state("Accepter::accepting"). We want it to be
extensible so that you can start off with simple brackets and then
move on to deeper levels like
"Accepter::accepting::waiting"->"Accepter::accepting::reading_other_addr",
etc. And to allow states for more than one module — allowing the
SimpleMessenger to store the state of the Accepter as well as the
state for the Dispatcher, etc. This suggests to me a pretty simple
implementation where we grab out the first word of the state as the
module (Accepter, Dispatcher) and look those up in a map, then use the
rest of the state as part of a recursing struct so it can nest
arbitrarily. We have an instance as part of each Messenger and as part
of each Pipe and insert set_state calls through the SimpleMessenger
code as we decide we care about them.

The system for testing is broken into two big pieces. One is a
MessengerDriver, which acts as the client for a single SimpleMessenger
instance. The DebugState has hooks to notify the MessengerDriver on
state changes, and we will instrument the SimpleMessenger's syscalls
to pass through the MessengerDriver (using either macros or pluggable
objects) so it can inject failures on demand.
The second piece is a TestDriver, which creates MessengerDriver
objects and is responsible for feeding them test orders on how to
behave. This interface can start off pretty simply, eg as simple
hard-coded arrays of tests to run, but I expect it to evolve, perhaps
to the point where we can programmatically generate complicated
many-to-many tests in Python.
The interface between the TestDriver and the MessengerDriver should be
pretty simple, consisting largely of the function
test_orders(vector<string>& ops).
Test orders are lists of strings(for now. Better later). These strings
can be things like:
connect <ip>: initiate a connect attempt to the given IP
send <message> <ip>: start sending the given message to the given IP
wait <module> <n> <state>: wait until you've seen the given state n
times in the given module
fail <module> <function> <error code>: the next time the given module
calls the given syscall function, return the given error code
shutdown: destroy the attached SimpleMessenger and return.
These can be expanded later to do fancier things like: (these examples
allow even more precise cross-Messenger synchronization)
block <cond> <module> <function>, block <cond> <module> <state> : the
next time <module> calls <function>/reaches <state>, block on <cond>
signal <cond> <module> <function>, signal <cond> <module> <state>: the
next time <module> calls <function>/reaches <state>, notify <cond>

Then the MessengerDriver will run through the given operations until
it runs out, and then when its code next executes it will block while
waiting for more instructions to come in via test_orders().

Because the MessengerDriver will be called into on every state change
and on every syscall, simple instructions like this let us write very
powerful, precisely-timed tests with a fairly small set of
instructions to interpret.

I'm not terribly interested in replacing the networking stack — doing
so would require writing our own routing code and I don't see much
benefit to it in terms of testability. Perhaps even a negative, in
that by going through the normal networking paths we will hit more of
TCP's bizarre behavior patterns (not all of them, obviously, since
we're all in a single process on a single machine, but more).
We will also need some glue to do things like pass bound IP addresses
around, etc, but I don't see that being important or difficult in
terms of the interfaces described.

Thoughts? Obvious points I've failed to consider?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html