Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 2012-10-19 at 14:00 +0530, Raghavendra K T wrote:
> On 10/15/2012 08:04 PM, Andrew Theurer wrote:
> > On Mon, 2012-10-15 at 17:40 +0530, Raghavendra K T wrote:
> >> On 10/11/2012 01:06 AM, Andrew Theurer wrote:
> >>> On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote:
> >>>> On 10/10/2012 08:29 AM, Andrew Theurer wrote:
> >>>>> On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
> >>>>>> * Avi Kivity <avi@xxxxxxxxxx> [2012-10-04 17:00:28]:
> >>>>>>
> >>>>>>> On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
> >>>>>>>> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
> >>>>>>>>>
> >> [...]
> >>>>> A big concern I have (if this is 1x overcommit) for ebizzy is that it
> >>>>> has just terrible scalability to begin with.  I do not think we should
> >>>>> try to optimize such a bad workload.
> >>>>>
> >>>>
> >>>> I think my way of running dbench has some flaw, so I went to ebizzy.
> >>>> Could you let me know how you generally run dbench?
> >>>
> >>> I mount a tmpfs and then specify that mount for dbench to run on.  This
> >>> eliminates all IO.  I use a 300 second run time and number of threads is
> >>> equal to number of vcpus.  All of the VMs of course need to have a
> >>> synchronized start.
> >>>
> >>> I would also make sure you are using a recent kernel for dbench, where
> >>> the dcache scalability is much improved.  Without any lock-holder
> >>> preemption, the time in spin_lock should be very low:
> >>>
> >>>
> >>>>       21.54%      78016         dbench  [kernel.kallsyms]   [k] copy_user_generic_unrolled
> >>>>        3.51%      12723         dbench  libc-2.12.so        [.] __strchr_sse42
> >>>>        2.81%      10176         dbench  dbench              [.] child_run
> >>>>        2.54%       9203         dbench  [kernel.kallsyms]   [k] _raw_spin_lock
> >>>>        2.33%       8423         dbench  dbench              [.] next_token
> >>>>        2.02%       7335         dbench  [kernel.kallsyms]   [k] __d_lookup_rcu
> >>>>        1.89%       6850         dbench  libc-2.12.so        [.] __strstr_sse42
> >>>>        1.53%       5537         dbench  libc-2.12.so        [.] __memset_sse2
> >>>>        1.47%       5337         dbench  [kernel.kallsyms]   [k] link_path_walk
> >>>>        1.40%       5084         dbench  [kernel.kallsyms]   [k] kmem_cache_alloc
> >>>>        1.38%       5009         dbench  libc-2.12.so        [.] memmove
> >>>>        1.24%       4496         dbench  libc-2.12.so        [.] vfprintf
> >>>>        1.15%       4169         dbench  [kernel.kallsyms]   [k] __audit_syscall_exit
> >>>
> >>
> >> Hi Andrew,
> >> I ran the test with dbench with tmpfs. I do not see any improvements in
> >> dbench for 16k ple window.
> >>
> >> So it seems apart from ebizzy no workload benefited by that. and I
> >> agree that, it may not be good to optimize for ebizzy.
> >> I shall drop changing to 16k default window and continue with other
> >> original patch series. Need to experiment with latest kernel.
> >
> > Thanks for running this again.  I do believe there are some workloads,
> > when run at 1x overcommit, would benefit from a larger ple_window [with
> > he current ple handling code], but I do not also want to potentially
> > degrade >1x with a larger window.  I do, however, think there may be a
> > another option.  I have not fully worked this out, but I think I am on
> > to something.
> >
> > I decided to revert back to just a yield() instead of a yield_to().  My
> > motivation was that yield_to() [for large VMs] is like a dog chasing its
> > tail, round and round we go....   Just yield(), in particular a yield()
> > which results in yielding to something -other- than the current VM's
> > vcpus, helps synchronize the execution of sibling vcpus by deferring
> > them until the lock holder vcpu is running again.  The more we can do to
> > get all vcpus running at the same time, the far less we deal with the
> > preemption problem.  The other benefit is that yield() is far, far lower
> > overhead than yield_to()
> >
> > This does assume that vcpus from same VM do not share same runqueues.
> > Yielding to a sibling vcpu with yield() is not productive for larger VMs
> > in the same way that yield_to() is not.  My recent results include
> > restricting vcpu placement so that sibling vcpus do not get to run on
> > the same runqueue.  I do believe we could implement a initial placement
> > and load balance policy to strive for this restriction (making it purely
> > optional, but I bet could also help user apps which use spin locks).
> >
> > For 1x VMs which still vm_exit due to PLE, I believe we could probably
> > just leave the ple_window alone, as long as we mostly use yield()
> > instead of yield_to().  The problem with the unneeded exits in this case
> > has been the overhead in routines leading up to yield_to() and the
> > yield_to() itself.  If we use yield() most of the time, this overhead
> > will go away.
> >
> > Here is a comparison of yield_to() and yield():
> >
> > dbench with 20-way VMs, 8 of them on 80-way host:
> >
> > no PLE			  426 +/- 11.03%
> > no PLE w/ gangsched	32001 +/- .37%
> > PLE with yield()	29207 +/- .28%
> > PLE with yield_to()	 8175 +/- 1.37%
> >
> > Yield() is far and way better than yield_to() here and almost approaches
> > gang sched result.  Here is a link for the perf sched map bitmap:
> >
> > https://docs.google.com/open?id=0B6tfUNlZ-14weXBfVnFFZGw1akU
> >
> > The thrashing is way down and sibling vcpus tend to run together,
> > approximating the behavior of the gang scheduling without needing to
> > actually implement gang scheduling.
> >
> > I did test a smaller VM:
> >
> > dbench with 10-way VMs, 16 of them on 80-way host:
> >
> > no PLE			 6248 +/- 7.69%	
> > no PLE w/ gangsched	28379 +/- .07%
> > PLE with yield()	29196 +/- 1.62%
> > PLE with yield_to()	32217 +/- 1.76%
> 
> Hi Andrew, Results are encouraging.
> 
> >
> > There is some degrade from yield() to yield_to() here, but nearly as
> > large as the uplift we see on the larger VMs.  Regardless, I have an
> > idea to fix that: Instead of using yield() all the time, we could use
> > yield_to(), but limit the rate per vcpu to something like 1 per jiffie.
> > All other exits use yield().  That rate of yield_to() should be more
> > than enough for the smaller VMs, and the result should be hopefully just
> > the same as the current code.  I have not coded this up yet, but it's my
> > next step.
> 
> I personally feel rate limiting yield_to may be a good idea.
> 
> >
> > I am also hopeful the limitation of yield_to() will also make the 1x
> > issue just go away as well (even with 4096 ple_window).  The vast
> > majority of exits will result in yield() which should be harmless.
> >
> > Keep in mind this did require ensuring sibling vcpus do not share host
> > runqueues -I do think that can be possible given some optional scheduler
> > tweaks.
> 
> I think this is a concern (placing). Having rate limit alone may
> suffice.May be tuning that taking into overcommitted/non-overcommitted
> scenario also into account would be better.
> 
> Okay below is my V2 implementation I am experimenting
> 
> 1) check source -and- target runq to decide on exiting the ple handler
> 2)
> 
> vcpu_on_spin()
> {
> 
>   .....
>   if yield_to_same_vm did not succeed and we are overcommitted
>      yield()
> 
> }
> 
> I think combining your thoughts and (2) complicates scenario a bit.
> anyways let me see how my experiment goes. I will also check how yield
> performs without any pinning.

FWIW, below is the latest with throttling yield_to().  Results were
slightly higher than the above with just yield().  Although I can see an
improvement when not forcing non-shared runqueues among same-VM vcpus
(via binding), it's not as effective.  I am more concerned this problem
requires a multi-part solution, and reducing lock-holder preemption is
the other part (by not allowing sequential execution of same-VM vcpus by
virtue of sharing runqueues).

signed-off-by: Andrew Theurer <habanero@xxxxxxxxxxxxxxxxxx>

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b70b48b..595ef3e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -153,6 +153,7 @@ struct kvm_vcpu {
 	int mode;
 	unsigned long requests;
 	unsigned long guest_debug;
+	unsigned long last_yield_to;
 
 	struct mutex mutex;
 	struct kvm_run *run;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d617f69..1f0ec36 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -49,6 +49,7 @@
 #include <linux/slab.h>
 #include <linux/sort.h>
 #include <linux/bsearch.h>
+#include <linux/jiffies.h>
 
 #include <asm/processor.h>
 #include <asm/io.h>
@@ -228,6 +229,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	vcpu->pid = NULL;
 	init_waitqueue_head(&vcpu->wq);
 	kvm_async_pf_vcpu_init(vcpu);
+	vcpu->last_yield_to = 0;
 
 	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
 	if (!page) {
@@ -1590,27 +1592,39 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 	int i;
 
 	/*
+	 * A yield_to() can be quite expensive, so we try to limit its use.
+	 * to one per jiffie.  Subsequent exits just yield the current vcpu
+	 * in hopes of having it run again when the lock holding vcpu
+	 * gets to run again.  This is most effective when vcpus from 
+	 * the same VM do not share a runqueue
+	 */
+	if (me->last_yield_to == jiffies) {
+		yield();
+	} else {
+	/*
 	 * We boost the priority of a VCPU that is runnable but not
 	 * currently running, because it got preempted by something
 	 * else and called schedule in __vcpu_run.  Hopefully that
 	 * VCPU is holding the lock that we need and will release it.
 	 * We approximate round-robin by starting at the last boosted VCPU.
 	 */
-	for (pass = 0; pass < 2 && !yielded; pass++) {
-		kvm_for_each_vcpu(i, vcpu, kvm) {
-			if (!pass && i <= last_boosted_vcpu) {
-				i = last_boosted_vcpu;
-				continue;
-			} else if (pass && i > last_boosted_vcpu)
-				break;
-			if (vcpu == me)
-				continue;
-			if (waitqueue_active(&vcpu->wq))
-				continue;
-			if (kvm_vcpu_yield_to(vcpu)) {
-				kvm->last_boosted_vcpu = i;
-				yielded = 1;
-				break;
+		for (pass = 0; pass < 2 && !yielded; pass++) {
+			kvm_for_each_vcpu(i, vcpu, kvm) {
+				if (!pass && i <= last_boosted_vcpu) {
+					i = last_boosted_vcpu;
+					continue;
+				} else if (pass && i > last_boosted_vcpu)
+					break;
+				if (vcpu == me)
+					continue;
+				if (waitqueue_active(&vcpu->wq))
+					continue;
+				if (kvm_vcpu_yield_to(vcpu)) {
+					kvm->last_boosted_vcpu = i;
+					me->last_yield_to = jiffies;
+					yielded = 1;
+					break;
+				}
 			}
 		}
 	}


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux