Re: [RFC PATCH 3/3] test-lib: add the '--stress' option to run a test repeatedly under load

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Tue, 04 Dec 2018 19:11:08 +0100

On Tue, Dec 04 2018, SZEDER Gábor wrote:

> Unfortunately, we have a few flaky tests, whose failures tend to be
> hard to reproduce.  We've found that the best we can do to reproduce
> such a failure is to run the test repeatedly while the machine is
> under load, and wait in the hope that the load creates enough variance
> in the timing of the test's commands that a failure is evenually
> triggered.  I have a command to do that, and I noticed that two other
> contributors have rolled their own scripts to do the same, all
> choosing slightly different approaches.
>
> To help reproduce failures in flaky tests, introduce the '--stress'
> option to run a test script repeatedly in multiple parallel
> invocations until one of them fails, thereby using the test script
> itself to increase the load on the machine.
>
> The number of parallel invocations is determined by, in order of
> precedence: the number specified as '--stress=<N>', or the value of
> the GIT_TEST_STRESS_LOAD environment variable, or twice the number of
> available processors in '/proc/cpuinfo', or 8.
>
> To prevent the several parallel invocations of the same test from
> interfering with each other:
>
>   - Include the parallel job's number in the name of the trash
>     directory and the various output files under 't/test-results/' as
>     a '.stress-<Nr>' suffix.
>
>   - Add the parallel job's number to the port number specified by the
>     user or to the test number, so even tests involving daemons
>     listening on a TCP socket can be stressed.
>
>   - Make '--stress' imply '--verbose-log' and discard the test's
>     standard ouput and error; dumping the output of several parallel
>     tests to the terminal would create a big ugly mess.
>
> 'wait' for all parallel jobs before exiting (either because a failure
> was found or because the user lost patience and aborted the stress
> test), allowing the still running tests to finish.  Otherwise the "OK
> X.Y" progress output from the last iteration would likely arrive after
> the user got back the shell prompt, interfering with typing in the
> next command.  OTOH, this waiting might induce a considerable delay
> between hitting ctrl-C and the test actually exiting; I'm not sure
> this is the right tradeoff.

I think it makes sense to generalize this and split it up into two
features.

It's a frequent annoyance of mine in the test suite that I'm
e.g. running t*.sh with some parallel "prove" in one screen, and then I
run tABCD*.sh manually, and get unlucky because they use the same trash
dir, and both tests go boom.

You can fix that with --root, which is much of what this patch does. My
one-liner for doing --stress has been something like:

    perl -E 'say ++$_ while 1' | parallel --jobs=100% --halt-on-error soon,fail=1 './t0000-basic.sh --root=trash-{} -v'

But it would be great if I didn't have to worry about that and could
just run two concurrent:

    ./t0000-basic.sh

So I think we could just set some env variable where instead of having
the predictable trash directory we have a $TRASHDIR.$N as this patch
does, except we pick $N by locking some test-runs/tABCD.Nth file with
flock() during the run.

Then a stress mode like this would just be:

    GIT_TEST_FLOCKED_TRASH_DIRS=1 perl -E 'say ++$_ while 1' | parallel --jobs=100% --halt-on-error soon,fail=1 './t0000-basic.sh'

And sure, we could ship a --stress option too, but it wouldn't be
magical in any way, just another "spawn N in a loop" implementation, but
you could also e.g. use GNU parallel to drive it, and without needing to
decide to stress test in advance, since we'd either flock() the trash
dir, or just mktemp(1)-it.