Re: [kvm-unit-tests PATCH] Support micro operation measurement on arm64

Andrew Jones <drjones@xxxxxxxxxx> · Tue, 19 Dec 2017 13:11:09 +0100

On Tue, Dec 19, 2017 at 10:06:20AM +0100, Christoffer Dall wrote:
> On Mon, Dec 18, 2017 at 03:58:49PM -0500, Shih-Wei Li wrote:
> > On Mon, Dec 18, 2017 at 1:14 PM, Andrew Jones <drjones@xxxxxxxxxx> wrote:
> > > Hi Shih-Wei,
> > >
> > > Thanks for doing this! Porting Christoffer's selftests to kvm-unit-tests
> > > has been on the kvm-unit-tests' TODO list since it was first introduced.
> > >
> > > On Fri, Dec 15, 2017 at 04:15:38PM -0500, Shih-Wei Li wrote:
> > >> The patch provides support for quantifying the cost of micro level
> > >> operations on arm64 hardware. The supported operations include hypercall,
> > >> mmio accesses, EOI virtual interrupt, and IPI send. Measurements are
> > >> currently obtained using timer counters. Further modifications in KVM
> > >> will be required to support timestamping using cycle counters, as KVM
> > >> now disables accesses to the PMU counters from the VM.
> > >
> > > KVM only disables access when userspace tells it to, which it doesn't
> > > do by default. Is there something else missing keeping the PMU counters
> > > from being used?
> > 
> > Thanks for the feedback! What I meant by PMU counters here was for
> > "CPU cycle counter" specifically. I'm not aware of a way to enable the
> > PMU cycle counter from QEMU, did I miss something here?
> > 
> 
> We always set MDSCR_EL2.TPM, meaning that you cannot reliably read a
> cycle counter in the guest.
> 
> If userspace tells KVM to emulate a PMU, you will get an emulated result
> when reading the cycle counter from a guest, instead of an undefined
> exception, but you will never access the cycle counter directly.

Ah, of course. Real vs. emulated access makes a big difference here.

> 
> Here we want to measure round-trip time from the VM through the
> hypervisor, and we don't currently count cycles in EL2 with the PMU
> emulation, and even if we did, we'd be counting additional round-trip
> times, so if the goal is to get more precision than the arch counters,
> this won't help you.
> 
> What we did for the papers was to hack KVM to not set the TPM bit and
> jut read the cycle counter directly, but this isn't safe, as the guest
> then gets full access to the PMU and can mess with the host.
> 
> If it's crucial to measure individual operations on a cycle-accurate
> level, then our options are pretty much to either patch KVM when doing
> so, or introduce a scary command line parameter, but I'm not thrilled
> by the idea.
> 
> > >
> > >>
> > >> We iterate each of the tests for millions of times and output their
> > >> average, minimum and maximum cost in timer counts. Instruction barriers
> > >
> > > Can we reduce the number of iterations and still get valid results? The
> > > test takes so long that of all the platforms I tested it on timed out
> > > before it completed, except seattle. The default timeout for kvm-unit-
> > > tests is 90 seconds. I'd rather a unit test execute in much shorter time
> > > than that too, in order to keep people encouraged to run them frequently.
> > > If these tests must run a long time, then I think we should add them to
> > > the nodefault group.
> > 
> > I think it's possible to reduce the timeout without losing accuracy. I
> > can look into this further.
> > 
> 
> I think just running them for 100,000 or maximum 1,000,000 times should
> be sufficient.  Alternatively an option to run it for a long time could
> be provided?

Providing a number of iterations option or something, that has a
reasonable default, sounds good to me.

> 
> > >
> > >> were used before and after taking timestamps to avoid out-of-order
> > >> execution or pipelining from skewing our measurements.
> > >>
> > >> To improve precision in the measurements, one should consider pinning
> > >> each VCPU to a specific physical CPU (PCPU) and ensure no other task
> > >> could run on that PCPU to skew the results. This can be achieved by
> > >> enabling QMP server in the QEMU command in unittest.cfg for micro test,
> > >> allowing a client program to get the thread_id for each VCPU thread
> > >> from the QMP server. Based on the information, the client program can
> > >> then pin the corresponding VCPUs to dedicated PCPUs and isolate
> > >> interrupts and tasks from those PCPUs.
> > >
> > > To isolate the CPUs one would need to boot the host with the isolcpus
> > > kernel command line option. Pinning the VCPUs is pretty easy though,
> > > so we could provide a script that does that in kvm-unit-tests and then
> > > always use it for this test. The script could also warn if we're
> > > pinning to CPUs that haven't been isolated.
> > >
> > 
> > My intention was to support VCPU pinning as an optional feature,
> > so the users that care about extra precision can add qmp option in
> > QEMU config and run the script to pin VCPUs. Otherwise, the test can
> > be conducted in a fashion similar to what's done in vmexit on x86.
> > 
> 
> If we can script VCPU pinning, I think that's preferred.  In our
> experiments we never actually saw measurable differences between
> isolcpus and simple vcpu pinning when using a high enough number of
> iterations, except when looking at things like jitter, which we don't do
> for these tests.
> 
> That notwithstanding, I think it's an optional feature that can be added
> later.

Yeah, let's do it later, but I think doing it makes enough sense that
it's worth writing more bash.

> 
> > >>
> > >> The patch has been tested on arm64 hardware including AMD Seattle and
> > >> ThunderX2, which has GICv2 and GICv3 respectively.
> > >
> > > I tried thunderx2, amberwing, mustang, and seattle. Only seattle
> > > completed, the rest timed out.
> > 
> > I have only tested the code by invoking test directly using make
> > standalone like the following. I did notice that it took ~90 seconds
> > to finish the test itself.
> > ./"tests/micro-cost"

standalone still uses timeout with 90 seconds. So your hardware was just
faster than mine, I guess :-)

> > 
> 
> Let's try to bring this down for the next iteration.
> 
> Thanks,
> -Christoffer

Thanks,
drew