On Mon, Dec 19, 2022 at 04:01:09PM -0800, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > Generate the section report between tests so that the summary report > always reflects the outcome of the most recent test. Two usecases are > envisioned here -- if a cluster-based test runner anticipates that the > testrun could crash the VM, they can set REPORT_DIR to (say) an NFS > mount to preserve the intermediate results. If the VM does indeed > crash, the scheduler can examine the state of the crashed VM and move > the tests to another VM. The second usecase is a reporting agent that > runs in the VM to upload live results to a test dashboard. Leah has been working on adding crash recovery for gce-xfstests. It'll be interesting to see how her work dovetails with your patches. The basic design we've worked out works by having the test framework recognize whether the VM had been had been previously been running tests. We keep track of the last test that was run by hooking into $LOGGER_PROG. We then use a python script[1] to append to the xunit file a test result for the test that was running at the time of the crash, and we set the test result to "error", and then we resume running tests from where we had left off. [1] https://github.com/lrumancik/xfstests-bld/blob/ltm-auto-resume-new/test-appliance/files/usr/local/bin/add_error_xunit To deal with cases where the kernel has deadlocked, when the test VM is launched by the LTM server, the LTM server will monitor the test VM, if the LTM server notices that the test VM has failed to make forward progress within a set time, it will force the test VM to reboot, at which point the recovery process described above kicks in. Eventually, we'll have the LTM server examine the serial console of the test VM, looking for indications of kernel panics and RCU / soft lockup warnings, so we can more quickly force a reboot when the system under test is clearly unhappy. The advantage of this design is that it doesen't require using NFS to store the results, and in theory we don't even need to use a separate monitoring VM; we could just use a software and kernel watchdogs to notice when the tests have stopped making forward progress. - Ted P.S. We're not using section reporting since we generally use launch separate VM's for each "section" so we can speed up the test run time by sharding across those VM's. And then we have the LTM server merge the results together into a single test run report.