Re: [PATCH net-next] selftests/net: ignore timing errors in so_txtime if KSFT_MACHINE_SLOW

Paolo Abeni <pabeni@xxxxxxxxxx> · Tue, 06 Feb 2024 10:18:33 +0100

On Fri, 2024-02-02 at 19:31 -0500, Willem de Bruijn wrote:
> Jakub Kicinski wrote:
> > On Thu,  1 Feb 2024 11:21:19 -0500 Willem de Bruijn wrote:
> > > This test is time sensitive. It may fail on virtual machines and for
> > > debug builds.
> > > 
> > > Continue to run in these environments to get code coverage. But
> > > optionally suppress failure for timing errors (only). This is
> > > controlled with environment variable KSFT_MACHINE_SLOW.
> > > 
> > > The test continues to return 0 (KSFT_PASS), rather than KSFT_XFAIL
> > > as previously discussed. Because making so_txtime.c return that and
> > > then making so_txtime.sh capture runs that pass that vs KSFT_FAIL
> > > and pass it on added a bunch of (fragile bash) boilerplate, while the
> > > result is interpreted the same as KSFT_PASS anyway.
> > 
> > FWIW another idea that came up when talking to Matthieu -
> > isolate the VMs which run time-sensitive tests to dedicated
> > CPUs. Right now we kick off around 70 4 CPU VMs and let them 
> > battle for 72 cores. The machines don't look overloaded but
> > there can be some latency spikes (CPU use diagram attached).
> > 
> > So the idea would be to have a handful of special VMs running 
> > on dedicated CPUs without any CPU time competition. That could help 
> > with latency spikes. But we'd probably need to annotate the tests
> > which need some special treatment.
> > 
> > Probably too much work both to annotate tests and set up env,
> > but I thought I'd bring it up here in case you had an opinion.
> 
> I'm not sure whether the issue with timing in VMs is CPU affinity.
> Variance may just come from expensive hypercalls, even with a
> dedicated CPU. Though tests can tell.

FTR, I think the CPU affinity setup is a bit too complex, and hard to
reproduce for 3rd parties willing to investigate eventual future CI
failures, I think the current env-variable-based approach would help
with reproducibility.

> There's still the debug builds, as well.

I understand/hope you are investigating it? 

Cheers,

Paolo