Re: [PATCH v2 0/6] x86/kernel/hyper-v: xmm fast hypercall

Roman Kagan <rkagan@xxxxxxxxxxxxx> · Tue, 30 Oct 2018 07:19:48 +0000

On Mon, Oct 29, 2018 at 07:43:19PM -0700, Isaku Yamahata wrote:
> On Mon, Oct 29, 2018 at 06:22:14PM +0000,
> Roman Kagan <rkagan@xxxxxxxxxxxxx> wrote:
> > On Wed, Oct 24, 2018 at 09:48:25PM -0700, Isaku Yamahata wrote:
> > > With this patch, HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE without
> > > gva list, HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX(vcpu > 64) and
> > > HVCALL_SEND_IPI_EX(vcpu > 64) can use xmm fast hypercall.
> > > 
> > > benchmark result:
> > > At the moment, my test machine have only pcpu=4, ipi benchmark doesn't
> > > make any behaviour change. So for now I measured the time of
> > > hyperv_flush_tlb_others() by ktap with 'hardinfo -r -f text'.
> > 
> > This suggests that the guest OS was Linux with your patches 1-4.  What
> > was the hypervisor?  KVM with your patch 5 or Hyper-V proper?
> 
> For patch 1-4, it's hyper-v.
> For patch 5, it's kvm with hyper-v hypercall support.

So you have two result sets?  Which one was in your post?

It'd also be curious to run some IPI or TLB flush sensintive benchmark
with a Windows guest.

> > > hyperv_flush_tlb_others() time by hardinfo -r -f text:
> > > 
> > > with path:       9931 ns
> > > without patch:  11111 ns
> > > 
> > > 
> > > With patch of 4bd06060762b, __send_ipi_mask() now uses fast hypercall
> > > when possible. so in the case of vcpu=4. So I used kernel before the parch
> > > to measure the effect of xmm fast hypercall with ipi_benchmark.
> > > The following is the average of 100 runs.
> > > 
> > > ipi_benchmark: average of 100 runs without 4bd06060762b
> > > 
> > > with patch:
> > > Dry-run                 0        495181
> > > Self-IPI         11352737      21549999
> > > Normal IPI      499400218     575433727
> > > Broadcast IPI           0    1700692010
> > > Broadcast lock          0    1663001374
> > > 
> > > without patch:
> > > Dry-run                 0        607657
> > > Self-IPI         10915950      21217644
> > > Normal IPI      621712609     735015570
> > 
> > This is about 122 ms difference in IPI sending time, and 160 ms in
> > total time, i.e. extra 38 ms for the acknowledge.  AFAICS the
> > acknowledge path should be exactly the same.  Any idea where these
> > additional 38 ms come from?
> > 
> > > Broadcast IPI           0    2173803373
> > 
> > This one is strange, too: the difference should only be on the sending
> > side, and there it should be basically constant with the number of cpus.
> > So I would expect the patched vs unpatched delta to be about the same as
> > for "Normal IPI".  Am I missing something?
> 
> The result seems very sensitive to host activity and so is unstable.
> (pcpu=vcpu=4 in the benchmark.)
> Since the benchmark should be on large machine(vcpu>64) anyway,

IMO the bigger the vcpu set you want to pass in the hypercall, the less
competitive the xmm fast version is.

I think realistically every implementation of xmm fast both in the guest
and in the hypervisor will actually use the parameter block in memory
(and so does yours), so the difference between xmm fast and regular
hypercalls is the cost of loading/storing the parameter block to/from
xmm (+ preserving the task fpu state) in the guest vs mapping the
parameter block in the hypervisor.  The latter is constant (per spec the
parameter block can't cross page boundaries so it's always one page
exactly), the former grows with the size of the parameter block.

So I think if there's no conclusive win on a small machine there's no
reason to expect it to be on a big one.

Thanks,
Roman.