[Hotplug_sig] Lessons learned from hotplug CPU test case development

bryce at osdl.org (Bryce Harrington) · Wed Aug 24 12:37:40 2005

Lessons Learned from Hotplug CPU Test Case Development
======================================================

Looking back, the Hotplug CPU test case effort worked pretty well.  We
were able to involve a lot of different people with different levels of
skill in the process, such that the actual coding of the test cases was
fairly straightforward.

I think we benefit from following this same model as we start looking at
the memory hotplug testing, and based on the experience with CPU I have
some thoughts on how we could improve the process.  The following is
divided into "design-time" and "implementation-time" thoughts.  

TEST CASE DESIGN
================
1.  Make the test descriptions specific - We found that some of the CPU
    test cases were easier to code than others, because sometimes items
    were described ambiguously.  

    For example, "Verify the flux capaciter works" gives you four
    questions: At what point do we do the verification?  What's a flux
    capaciter?  How do you know that it works?  What do we do if it
    doesn't work?

    A better description would say, "After switching it on, check that
    LED42 is green on the Flux Capacitor Status Panel, and mark test
    FAILED if it isn't."

    Example commandlines are extremely helpful.  For instance, instead
    of saying, "Use 'foo' to check that bar works," even more useful
    would be, "Run something like 'foo -a -b 3 -c 10', and verify that
    'bar' is listed in the output.

2.  State even the obvious.  With hotplug01, one of the things that
    really slowed us down was that we didn't know what an "Affinity
    Mask" was, nor how to generate the hex masks, verify it, etc.  It
    was only after seeing someone else's test case that we finally
    understood.  In the test definitions, be sure to define new concepts
    or give pointers to examples.  If unfamiliar terminology gets in the
    way, it can really slow down implementation.

    Also, when writing the test case description, write down a few
    sentences describing what the test case should do, and what issue it
    addresses.  In some cases we had to do some research to back out
    what the test cases were for, and it'd save time to have this up
    front.  The person writing the test case probably already has this
    in mind at the time they write it, so it should be easy to jot down
    early on.

3.  Avoid reliance on interactive tools.  Some of the biggest challenges
    we had with implementing the test cases was with phrases that
    specifying monitoring with interactive tools.  For example, "Start a
    and b, then watch on top to make sure c happens."  top is easy for
    use in interactive testing, but for test cases it can be complicated
    to rely on tools like these.  We were able to usually find some
    other approach, such as extracting from a file in /proc.

4.  Pseudocode.  We found that after reviewing the test cases,
    implementing them in crude pseudocode was worthwhile, because it
    helped identify the basic logic in a form that could be easily
    translated into bash.  This intermediate format was also simple
    enough that others could review and comment on it without having to
    dig through extraneous structural and error checking code.

    The key objective for writing pseudocode is to work out solutions
    for the technical details, like how to operate various tools or
    workloads, or to define algorithms for parsing output from other
    tools, order of operations, loop structures, and so forth.  Things
    like error handling, output formatting, etc. can be left for the
    implementor to do.

TEST CASE IMPLEMENTATION
========================
5.  Keep It Short and Simple (KISS) - We made each test case focused on
    a specific kind of test.  We preferred breaking things like file
    parsing, workloads, etc. into separate scripts than to try to pile
    everything into a single script.  Do one thing and do it well.
    We also designed the test cases so they can be run one time through
    very quickly, or optionally be run in loops; this way a developer
    could use the suite as a quick sanity check, while a tester could
    configure it to run continously under a workload.

    Think of the individual test cases as building blocks to assemble
    more complex tests from later.  By keeping the test cases simple,
    and doing more sophisticated logic at a higher level, it gives
    everyone more flexibility and power for future testing.

6.  Parameterize with environment variables.  In general, each test case
    should run with no parameters or with an absolute minimum, so it's
    easy to use.  However, the test cases are more useful if you can
    override certain internal settings such as the number of loops to
    run through.

    We found the most convenient way to do this in bash was with
    environment variables.  Here is the syntax we used:

    HOTPLUG06_LOOPS=${HOTPLUG06_LOOPS:-${LOOPS}}
    loop_six=${HOTPLUG06_LOOPS:-1}

    This allows the user to override the looping for this specific test
    case by setting the environment variable $HOTPLUG06_LOOPS, or allow
    them to set looping for _all_ test cases via the variable $LOOPS.
    By default, if neither parameter is specified, it defaults to 1
    loop.

    When describing test cases, also think about other ways to
    parameterize it, such as length of time to sleep between various
    operations, temporary file names, names and paths for input or
    output files, commands that may be platform-specific, etc.

7.  Cleanup after yourself.  The CPU hotplug test cases turned CPUs on
    and off.  Obviously, it could be annoying if the test suite left
    some of your processors off that weren't off before when it
    finished!

    We adopted the practice of working to ensure the test cases each
    left the system more or less as it found it.  Thus, a test case that
    attempts to turn on and off all CPUs would keep track of which CPUs
    were on or off at the start, then do its testing, and then restore
    the CPU's to the original on/off states.

    We also liked Ashok Raj's approach of trapping user interrupt
    signals to perform cleanups:

    trap "do_intr" 1 2 15

    do_intr()
    {
        echo "HotPlug01   FAIL: User interrupt"
        do_clean
        exit 1
    }

    This is in test case 1, with plans to add it to all the other test
    cases.