Re: [kvm-unit-tests PATCH] Support micro operation measurement on arm64

Christoffer Dall <christoffer.dall@xxxxxxxxxx> · Tue, 19 Dec 2017 10:06:20 +0100

On Mon, Dec 18, 2017 at 03:58:49PM -0500, Shih-Wei Li wrote:
> On Mon, Dec 18, 2017 at 1:14 PM, Andrew Jones <drjones@xxxxxxxxxx> wrote:
> > Hi Shih-Wei,
> >
> > Thanks for doing this! Porting Christoffer's selftests to kvm-unit-tests
> > has been on the kvm-unit-tests' TODO list since it was first introduced.
> >
> > On Fri, Dec 15, 2017 at 04:15:38PM -0500, Shih-Wei Li wrote:
> >> The patch provides support for quantifying the cost of micro level
> >> operations on arm64 hardware. The supported operations include hypercall,
> >> mmio accesses, EOI virtual interrupt, and IPI send. Measurements are
> >> currently obtained using timer counters. Further modifications in KVM
> >> will be required to support timestamping using cycle counters, as KVM
> >> now disables accesses to the PMU counters from the VM.
> >
> > KVM only disables access when userspace tells it to, which it doesn't
> > do by default. Is there something else missing keeping the PMU counters
> > from being used?
> 
> Thanks for the feedback! What I meant by PMU counters here was for
> "CPU cycle counter" specifically. I'm not aware of a way to enable the
> PMU cycle counter from QEMU, did I miss something here?
> 

We always set MDSCR_EL2.TPM, meaning that you cannot reliably read a
cycle counter in the guest.

If userspace tells KVM to emulate a PMU, you will get an emulated result
when reading the cycle counter from a guest, instead of an undefined
exception, but you will never access the cycle counter directly.

Here we want to measure round-trip time from the VM through the
hypervisor, and we don't currently count cycles in EL2 with the PMU
emulation, and even if we did, we'd be counting additional round-trip
times, so if the goal is to get more precision than the arch counters,
this won't help you.

What we did for the papers was to hack KVM to not set the TPM bit and
jut read the cycle counter directly, but this isn't safe, as the guest
then gets full access to the PMU and can mess with the host.

If it's crucial to measure individual operations on a cycle-accurate
level, then our options are pretty much to either patch KVM when doing
so, or introduce a scary command line parameter, but I'm not thrilled
by the idea.

> >
> >>
> >> We iterate each of the tests for millions of times and output their
> >> average, minimum and maximum cost in timer counts. Instruction barriers
> >
> > Can we reduce the number of iterations and still get valid results? The
> > test takes so long that of all the platforms I tested it on timed out
> > before it completed, except seattle. The default timeout for kvm-unit-
> > tests is 90 seconds. I'd rather a unit test execute in much shorter time
> > than that too, in order to keep people encouraged to run them frequently.
> > If these tests must run a long time, then I think we should add them to
> > the nodefault group.
> 
> I think it's possible to reduce the timeout without losing accuracy. I
> can look into this further.
> 

I think just running them for 100,000 or maximum 1,000,000 times should
be sufficient.  Alternatively an option to run it for a long time could
be provided?

> >
> >> were used before and after taking timestamps to avoid out-of-order
> >> execution or pipelining from skewing our measurements.
> >>
> >> To improve precision in the measurements, one should consider pinning
> >> each VCPU to a specific physical CPU (PCPU) and ensure no other task
> >> could run on that PCPU to skew the results. This can be achieved by
> >> enabling QMP server in the QEMU command in unittest.cfg for micro test,
> >> allowing a client program to get the thread_id for each VCPU thread
> >> from the QMP server. Based on the information, the client program can
> >> then pin the corresponding VCPUs to dedicated PCPUs and isolate
> >> interrupts and tasks from those PCPUs.
> >
> > To isolate the CPUs one would need to boot the host with the isolcpus
> > kernel command line option. Pinning the VCPUs is pretty easy though,
> > so we could provide a script that does that in kvm-unit-tests and then
> > always use it for this test. The script could also warn if we're
> > pinning to CPUs that haven't been isolated.
> >
> 
> My intention was to support VCPU pinning as an optional feature,
> so the users that care about extra precision can add qmp option in
> QEMU config and run the script to pin VCPUs. Otherwise, the test can
> be conducted in a fashion similar to what's done in vmexit on x86.
> 

If we can script VCPU pinning, I think that's preferred.  In our
experiments we never actually saw measurable differences between
isolcpus and simple vcpu pinning when using a high enough number of
iterations, except when looking at things like jitter, which we don't do
for these tests.

That notwithstanding, I think it's an optional feature that can be added
later.

> >>
> >> The patch has been tested on arm64 hardware including AMD Seattle and
> >> ThunderX2, which has GICv2 and GICv3 respectively.
> >
> > I tried thunderx2, amberwing, mustang, and seattle. Only seattle
> > completed, the rest timed out.
> 
> I have only tested the code by invoking test directly using make
> standalone like the following. I did notice that it took ~90 seconds
> to finish the test itself.
> ./"tests/micro-cost"
> 

Let's try to bring this down for the next iteration.

Thanks,
-Christoffer