Re: [PATCH 1/8] check: generate section reports between tests

"Theodore Ts'o" <tytso@xxxxxxx> · Mon, 19 Dec 2022 22:16:43 -0500

On Mon, Dec 19, 2022 at 04:01:09PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@xxxxxxxxxx>
> 
> Generate the section report between tests so that the summary report
> always reflects the outcome of the most recent test.  Two usecases are
> envisioned here -- if a cluster-based test runner anticipates that the
> testrun could crash the VM, they can set REPORT_DIR to (say) an NFS
> mount to preserve the intermediate results.  If the VM does indeed
> crash, the scheduler can examine the state of the crashed VM and move
> the tests to another VM.  The second usecase is a reporting agent that
> runs in the VM to upload live results to a test dashboard.

Leah has been working on adding crash recovery for gce-xfstests.
It'll be interesting to see how her work dovetails with your patches.

The basic design we've worked out works by having the test framework
recognize whether the VM had been had been previously been running
tests.  We keep track of the last test that was run by hooking into
$LOGGER_PROG.  We then use a python script[1] to append to the xunit file
a test result for the test that was running at the time of the crash,
and we set the test result to "error", and then we resume running
tests from where we had left off.

[1] https://github.com/lrumancik/xfstests-bld/blob/ltm-auto-resume-new/test-appliance/files/usr/local/bin/add_error_xunit

To deal with cases where the kernel has deadlocked, when the test VM
is launched by the LTM server, the LTM server will monitor the test
VM, if the LTM server notices that the test VM has failed to make
forward progress within a set time, it will force the test VM to
reboot, at which point the recovery process described above kicks in.

Eventually, we'll have the LTM server examine the serial console of
the test VM, looking for indications of kernel panics and RCU / soft
lockup warnings, so we can more quickly force a reboot when the system
under test is clearly unhappy.

The advantage of this design is that it doesen't require using NFS to
store the results, and in theory we don't even need to use a separate
monitoring VM; we could just use a software and kernel watchdogs to
notice when the tests have stopped making forward progress.

       	    	      	   	   - Ted

P.S.  We're not using section reporting since we generally use launch
separate VM's for each "section" so we can speed up the test run time
by sharding across those VM's.  And then we have the LTM server merge
the results together into a single test run report.