Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

Andrew Theurer <habanero@xxxxxxxxxxxxxxxxxx> · Mon, 15 Oct 2012 09:34:55 -0500

On Mon, 2012-10-15 at 17:40 +0530, Raghavendra K T wrote:
> On 10/11/2012 01:06 AM, Andrew Theurer wrote:
> > On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote:
> >> On 10/10/2012 08:29 AM, Andrew Theurer wrote:
> >>> On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
> >>>> * Avi Kivity <avi@xxxxxxxxxx> [2012-10-04 17:00:28]:
> >>>>
> >>>>> On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
> >>>>>> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
> >>>>>>>
> [...]
> >>> A big concern I have (if this is 1x overcommit) for ebizzy is that it
> >>> has just terrible scalability to begin with.  I do not think we should
> >>> try to optimize such a bad workload.
> >>>
> >>
> >> I think my way of running dbench has some flaw, so I went to ebizzy.
> >> Could you let me know how you generally run dbench?
> >
> > I mount a tmpfs and then specify that mount for dbench to run on.  This
> > eliminates all IO.  I use a 300 second run time and number of threads is
> > equal to number of vcpus.  All of the VMs of course need to have a
> > synchronized start.
> >
> > I would also make sure you are using a recent kernel for dbench, where
> > the dcache scalability is much improved.  Without any lock-holder
> > preemption, the time in spin_lock should be very low:
> >
> >
> >>      21.54%      78016         dbench  [kernel.kallsyms]   [k] copy_user_generic_unrolled
> >>       3.51%      12723         dbench  libc-2.12.so        [.] __strchr_sse42
> >>       2.81%      10176         dbench  dbench              [.] child_run
> >>       2.54%       9203         dbench  [kernel.kallsyms]   [k] _raw_spin_lock
> >>       2.33%       8423         dbench  dbench              [.] next_token
> >>       2.02%       7335         dbench  [kernel.kallsyms]   [k] __d_lookup_rcu
> >>       1.89%       6850         dbench  libc-2.12.so        [.] __strstr_sse42
> >>       1.53%       5537         dbench  libc-2.12.so        [.] __memset_sse2
> >>       1.47%       5337         dbench  [kernel.kallsyms]   [k] link_path_walk
> >>       1.40%       5084         dbench  [kernel.kallsyms]   [k] kmem_cache_alloc
> >>       1.38%       5009         dbench  libc-2.12.so        [.] memmove
> >>       1.24%       4496         dbench  libc-2.12.so        [.] vfprintf
> >>       1.15%       4169         dbench  [kernel.kallsyms]   [k] __audit_syscall_exit
> >
> 
> Hi Andrew,
> I ran the test with dbench with tmpfs. I do not see any improvements in
> dbench for 16k ple window.
> 
> So it seems apart from ebizzy no workload benefited by that. and I
> agree that, it may not be good to optimize for ebizzy.
> I shall drop changing to 16k default window and continue with other
> original patch series. Need to experiment with latest kernel.

Thanks for running this again.  I do believe there are some workloads,
when run at 1x overcommit, would benefit from a larger ple_window [with
he current ple handling code], but I do not also want to potentially
degrade >1x with a larger window.  I do, however, think there may be a
another option.  I have not fully worked this out, but I think I am on
to something.

I decided to revert back to just a yield() instead of a yield_to().  My
motivation was that yield_to() [for large VMs] is like a dog chasing its
tail, round and round we go....   Just yield(), in particular a yield()
which results in yielding to something -other- than the current VM's
vcpus, helps synchronize the execution of sibling vcpus by deferring
them until the lock holder vcpu is running again.  The more we can do to
get all vcpus running at the same time, the far less we deal with the
preemption problem.  The other benefit is that yield() is far, far lower
overhead than yield_to()

This does assume that vcpus from same VM do not share same runqueues.
Yielding to a sibling vcpu with yield() is not productive for larger VMs
in the same way that yield_to() is not.  My recent results include
restricting vcpu placement so that sibling vcpus do not get to run on
the same runqueue.  I do believe we could implement a initial placement
and load balance policy to strive for this restriction (making it purely
optional, but I bet could also help user apps which use spin locks).

For 1x VMs which still vm_exit due to PLE, I believe we could probably
just leave the ple_window alone, as long as we mostly use yield()
instead of yield_to().  The problem with the unneeded exits in this case
has been the overhead in routines leading up to yield_to() and the
yield_to() itself.  If we use yield() most of the time, this overhead
will go away.

Here is a comparison of yield_to() and yield():

dbench with 20-way VMs, 8 of them on 80-way host:

no PLE			  426 +/- 11.03%
no PLE w/ gangsched	32001 +/- .37%
PLE with yield()	29207 +/- .28%
PLE with yield_to()	 8175 +/- 1.37%

Yield() is far and way better than yield_to() here and almost approaches
gang sched result.  Here is a link for the perf sched map bitmap:

https://docs.google.com/open?id=0B6tfUNlZ-14weXBfVnFFZGw1akU

The thrashing is way down and sibling vcpus tend to run together,
approximating the behavior of the gang scheduling without needing to
actually implement gang scheduling.

I did test a smaller VM:

dbench with 10-way VMs, 16 of them on 80-way host:

no PLE			 6248 +/- 7.69%	  
no PLE w/ gangsched	28379 +/- .07%
PLE with yield()	29196 +/- 1.62%
PLE with yield_to()	32217 +/- 1.76%

There is some degrade from yield() to yield_to() here, but nearly as
large as the uplift we see on the larger VMs.  Regardless, I have an
idea to fix that: Instead of using yield() all the time, we could use
yield_to(), but limit the rate per vcpu to something like 1 per jiffie.
All other exits use yield().  That rate of yield_to() should be more
than enough for the smaller VMs, and the result should be hopefully just
the same as the current code.  I have not coded this up yet, but it's my
next step.

I am also hopeful the limitation of yield_to() will also make the 1x
issue just go away as well (even with 4096 ple_window).  The vast
majority of exits will result in yield() which should be harmless.

Keep in mind this did require ensuring sibling vcpus do not share host
runqueues -I do think that can be possible given some optional scheduler
tweaks.

> 
> (PS: Thanks for pointing towards, perf in latest kernel. It works fine.)
> 
> Results:
> dbench run for 120 sec 30 sec warmup 8 iterations using tmpfs
> base = 3.6.0-rc5 with ple handler optimization patch.
> 
> x => base + ple_window = 4k
> + => base + ple_window = 16k
> * => base + ple_gap = 0
> 
> dbench 1x overcommit case
> =========================
>      N           Min           Max        Median           Avg        Stddev
> x   8        5322.5       5519.05       5482.71     5461.0962     63.522276
> +   8       5255.45       5530.55       5496.94     5455.2137     93.070363
> *   8       5350.85       5477.81      5408.065     5418.4338     44.762697
> 
> 
> dbench 2x overcommit case
> ==========================
> 
>      N           Min           Max        Median           Avg        Stddev
> x   8       3054.32       3194.47       3137.33      3132.625     54.491615
> +   8        3040.8       3148.87      3088.615     3088.1887     32.862336
> *   8       3031.51       3171.99        3083.6     3097.4612     50.526977
> 

-Andrew

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html