Re: [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE

Gleb Natapov <gleb@xxxxxxxxxx> · Wed, 9 Oct 2013 16:26:11 +0300



On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote:
> 
> On 07.10.2013, at 18:53, Gleb Natapov <gleb@xxxxxxxxxx> wrote:
> 
> > On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote:
> >> 
> >> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier@xxxxxxx> wrote:
> >> 
> >>> On 07/10/13 17:04, Alexander Graf wrote:
> >>>> 
> >>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier@xxxxxxx> wrote:
> >>>> 
> >>>>> On an (even slightly) oversubscribed system, spinlocks are quickly 
> >>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a 
> >>>>> lock to be released, while the vcpu holding the lock may not be 
> >>>>> running at all.
> >>>>> 
> >>>>> This creates contention, and the observed slowdown is 40x for 
> >>>>> hackbench. No, this isn't a typo.
> >>>>> 
> >>>>> The solution is to trap blocking WFEs and tell KVM that we're now
> >>>>> spinning. This ensures that other vpus will get a scheduling boost,
> >>>>> allowing the lock to be released more quickly.
> >>>>> 
> >>>>>> From a performance point of view: hackbench 1 process 1000
> >>>>> 
> >>>>> 2xA15 host (baseline):	1.843s
> >>>>> 
> >>>>> 2xA15 guest w/o patch:	2.083s 4xA15 guest w/o patch:	80.212s
> >>>>> 
> >>>>> 2xA15 guest w/ patch:	2.072s 4xA15 guest w/ patch:	3.202s
> >>>> 
> >>>> I'm confused. You got from 2.083s when not exiting on spin locks to
> >>>> 2.072 when exiting on _every_ spin lock that didn't immediately
> >>>> succeed. I would've expected to second number to be worse rather than
> >>>> better. I assume it's within jitter, I'm still puzzled why you don't
> >>>> see any significant drop in performance.
> >>> 
> >>> The key is in the ARM ARM:
> >>> 
> >>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
> >>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp
> >>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions
> >>> permit the processor to suspend execution."
> >>> 
> >>> So, on a non-overcommitted system, you rarely hit a blocking spinlock,
> >>> hence not trapping. Otherwise, performance would go down the drain very
> >>> quickly.
> >> 
> >> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path.
> >> 
> > It will hurt performance if vcpu that holds the lock is running.
> 
> Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well.
> 
> 
For not contended locks it make sense. We need to recheck if x86
assumption is still true there, but x86 lock is ticketing which
has not only lock holder preemption, but also lock waiter
preemption problem which make overcommit problem even worse.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html