Re: The "memory" test is failing in the kvm-unit-tests CI

Thomas Huth <thuth@xxxxxxxxxx> · Mon, 3 Apr 2023 10:23:51 +0200

On 30/03/2023 21.37, Sean Christopherson wrote:
On Thu, Mar 30, 2023, Thomas Huth wrote:
On 29/03/2023 21.11, Sean Christopherson wrote:
On Wed, Mar 29, 2023, Thomas Huth wrote:

   Hi,

I noticed that in recent builds, the "memory" test started failing in the
kvm-unit-test CI. After doing some experiments, I think it might rather be
related to the environment than to a recent change in the k-u-t sources.

It used to work fine with commit 2480430a here in January:

   https://gitlab.com/kvm-unit-tests/kvm-unit-tests/-/jobs/3613156199#L2873

Now I've re-run the CI with the same commit 2480430a here and it is failing now:

   https://gitlab.com/thuth/kvm-unit-tests/-/jobs/4022074711#L2733

Can you provide the logs from the failing test, and/or the build artifacts?  I
tried, and failed, to find them on Gitlab.

Yes, that's still missing in the CI scripts ... I'll try to come up with a
patch that provides the logs as artifacts.

Meanwhile, here's a run with a manual "cat logs/memory.log":

https://gitlab.com/thuth/kvm-unit-tests/-/jobs/4029213352#L2726

Seems like these are the failing memory tests:

FAIL: clflushopt (ABSENT)
FAIL: clwb (ABSENT)

More than likely what is happening is that the platform supports CLFLUSHOPT and
CLWB (possibly even via a ucode patch update), but the CPUID bits are not being
enumerated to the guest.  Neither VMX nor SVM has intercept controls for the
instructions, so KVM has no way to enforce the the guest's CPUID model.  E.g.
the failures can be reproduce by manually hiding the features:

   rkt ./x86/run x86/memory.flat -smp 1 -cpu max,-clflushopt,-clwb

This isn't a KVM bug because of the virtualization hole.  And really, the test
itself is bogus when running on KVM precisely because of said hole (similar holes
exist for all the other instructions in the test).
>
The test appears to have been added for QEMU's TCG, which makes sense as there
shouldn't be any virtualization holes in a pure emulation environment.

That said, it is interesting that the test is suddenly failing, as it means
something is buggy.  If you can run commands on the host, check for host support
via /proc/cpuinfo.  If those come back negative (no support), then it would appear
that hardware or the host kernel is in a bad/unexpected state.

   grep -q clflushopt /proc/cpuinfo
   grep -q clwb /proc/cpuinfo

I dumped the cpuinfo here:

 https://cirrus-ci.com/task/4861043097206784?logs=main#L22

And indeed, clflushopt and clwb do not show up. It's a nested setup, so I 
guess the flags have been disabled on the L0 host already.

I guess there's not much we can do here except disabling the "memory" test 
on cirrus-CI now...

 Thomas