We on the team decided a while ago that it's past time to start looking seriously at how we can do proper testing of more of our core components without spinning up a full Ceph instance. We've been trying to sneak it in as we can on new features and modules, but after some recent experiences debugging and fixing the SimpleMessenger I got tasked with looking at how we can implement proper module tests for it. I did so yesterday and came up with a design outline that I'd like to start working on when the team gets back from FAST. We will be meeting about it as a group later and I welcome any input from the list over the next couple of days! :) First, we need to decide on the approach we want to take to running these tests. Sage suggested that we might want to spin up a bunch of Messengers and run them through a workload while randomly failing their network operations some random fraction of the time. This doesn't satisfy me for two reasons. First, I like to *know* that certain scenarios have been tested, which is difficult to check with random failure injection; second, I want to be able to check the post-failure state of the system to make sure that we haven't leaked resources or otherwise failed non-catastrophically. Given that, we need to define a testing framework with very fine-grained control over when failures are injected. My design consists of two parts. One is a mechanism for incrementally designating the state of the SimpleMessenger; the second is a system for testing based on this state, syscall fault injection, and purpose-written testing scripts. The interface for the state designator is simple: debug_state.set_state("Accepter::accepting"). We want it to be extensible so that you can start off with simple brackets and then move on to deeper levels like "Accepter::accepting::waiting"->"Accepter::accepting::reading_other_addr", etc. And to allow states for more than one module — allowing the SimpleMessenger to store the state of the Accepter as well as the state for the Dispatcher, etc. This suggests to me a pretty simple implementation where we grab out the first word of the state as the module (Accepter, Dispatcher) and look those up in a map, then use the rest of the state as part of a recursing struct so it can nest arbitrarily. We have an instance as part of each Messenger and as part of each Pipe and insert set_state calls through the SimpleMessenger code as we decide we care about them. The system for testing is broken into two big pieces. One is a MessengerDriver, which acts as the client for a single SimpleMessenger instance. The DebugState has hooks to notify the MessengerDriver on state changes, and we will instrument the SimpleMessenger's syscalls to pass through the MessengerDriver (using either macros or pluggable objects) so it can inject failures on demand. The second piece is a TestDriver, which creates MessengerDriver objects and is responsible for feeding them test orders on how to behave. This interface can start off pretty simply, eg as simple hard-coded arrays of tests to run, but I expect it to evolve, perhaps to the point where we can programmatically generate complicated many-to-many tests in Python. The interface between the TestDriver and the MessengerDriver should be pretty simple, consisting largely of the function test_orders(vector<string>& ops). Test orders are lists of strings(for now. Better later). These strings can be things like: connect <ip>: initiate a connect attempt to the given IP send <message> <ip>: start sending the given message to the given IP wait <module> <n> <state>: wait until you've seen the given state n times in the given module fail <module> <function> <error code>: the next time the given module calls the given syscall function, return the given error code shutdown: destroy the attached SimpleMessenger and return. These can be expanded later to do fancier things like: (these examples allow even more precise cross-Messenger synchronization) block <cond> <module> <function>, block <cond> <module> <state> : the next time <module> calls <function>/reaches <state>, block on <cond> signal <cond> <module> <function>, signal <cond> <module> <state>: the next time <module> calls <function>/reaches <state>, notify <cond> Then the MessengerDriver will run through the given operations until it runs out, and then when its code next executes it will block while waiting for more instructions to come in via test_orders(). Because the MessengerDriver will be called into on every state change and on every syscall, simple instructions like this let us write very powerful, precisely-timed tests with a fairly small set of instructions to interpret. I'm not terribly interested in replacing the networking stack — doing so would require writing our own routing code and I don't see much benefit to it in terms of testability. Perhaps even a negative, in that by going through the normal networking paths we will hit more of TCP's bizarre behavior patterns (not all of them, obviously, since we're all in a single process on a single machine, but more). We will also need some glue to do things like pass bound IP addresses around, etc, but I don't see that being important or difficult in terms of the interfaces described. Thoughts? Obvious points I've failed to consider? -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html