On Fri, 2012-10-19 at 14:00 +0530, Raghavendra K T wrote: > On 10/15/2012 08:04 PM, Andrew Theurer wrote: > > On Mon, 2012-10-15 at 17:40 +0530, Raghavendra K T wrote: > >> On 10/11/2012 01:06 AM, Andrew Theurer wrote: > >>> On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote: > >>>> On 10/10/2012 08:29 AM, Andrew Theurer wrote: > >>>>> On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote: > >>>>>> * Avi Kivity <avi@xxxxxxxxxx> [2012-10-04 17:00:28]: > >>>>>> > >>>>>>> On 10/04/2012 03:07 PM, Peter Zijlstra wrote: > >>>>>>>> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote: > >>>>>>>>> > >> [...] > >>>>> A big concern I have (if this is 1x overcommit) for ebizzy is that it > >>>>> has just terrible scalability to begin with. I do not think we should > >>>>> try to optimize such a bad workload. > >>>>> > >>>> > >>>> I think my way of running dbench has some flaw, so I went to ebizzy. > >>>> Could you let me know how you generally run dbench? > >>> > >>> I mount a tmpfs and then specify that mount for dbench to run on. This > >>> eliminates all IO. I use a 300 second run time and number of threads is > >>> equal to number of vcpus. All of the VMs of course need to have a > >>> synchronized start. > >>> > >>> I would also make sure you are using a recent kernel for dbench, where > >>> the dcache scalability is much improved. Without any lock-holder > >>> preemption, the time in spin_lock should be very low: > >>> > >>> > >>>> 21.54% 78016 dbench [kernel.kallsyms] [k] copy_user_generic_unrolled > >>>> 3.51% 12723 dbench libc-2.12.so [.] __strchr_sse42 > >>>> 2.81% 10176 dbench dbench [.] child_run > >>>> 2.54% 9203 dbench [kernel.kallsyms] [k] _raw_spin_lock > >>>> 2.33% 8423 dbench dbench [.] next_token > >>>> 2.02% 7335 dbench [kernel.kallsyms] [k] __d_lookup_rcu > >>>> 1.89% 6850 dbench libc-2.12.so [.] __strstr_sse42 > >>>> 1.53% 5537 dbench libc-2.12.so [.] __memset_sse2 > >>>> 1.47% 5337 dbench [kernel.kallsyms] [k] link_path_walk > >>>> 1.40% 5084 dbench [kernel.kallsyms] [k] kmem_cache_alloc > >>>> 1.38% 5009 dbench libc-2.12.so [.] memmove > >>>> 1.24% 4496 dbench libc-2.12.so [.] vfprintf > >>>> 1.15% 4169 dbench [kernel.kallsyms] [k] __audit_syscall_exit > >>> > >> > >> Hi Andrew, > >> I ran the test with dbench with tmpfs. I do not see any improvements in > >> dbench for 16k ple window. > >> > >> So it seems apart from ebizzy no workload benefited by that. and I > >> agree that, it may not be good to optimize for ebizzy. > >> I shall drop changing to 16k default window and continue with other > >> original patch series. Need to experiment with latest kernel. > > > > Thanks for running this again. I do believe there are some workloads, > > when run at 1x overcommit, would benefit from a larger ple_window [with > > he current ple handling code], but I do not also want to potentially > > degrade >1x with a larger window. I do, however, think there may be a > > another option. I have not fully worked this out, but I think I am on > > to something. > > > > I decided to revert back to just a yield() instead of a yield_to(). My > > motivation was that yield_to() [for large VMs] is like a dog chasing its > > tail, round and round we go.... Just yield(), in particular a yield() > > which results in yielding to something -other- than the current VM's > > vcpus, helps synchronize the execution of sibling vcpus by deferring > > them until the lock holder vcpu is running again. The more we can do to > > get all vcpus running at the same time, the far less we deal with the > > preemption problem. The other benefit is that yield() is far, far lower > > overhead than yield_to() > > > > This does assume that vcpus from same VM do not share same runqueues. > > Yielding to a sibling vcpu with yield() is not productive for larger VMs > > in the same way that yield_to() is not. My recent results include > > restricting vcpu placement so that sibling vcpus do not get to run on > > the same runqueue. I do believe we could implement a initial placement > > and load balance policy to strive for this restriction (making it purely > > optional, but I bet could also help user apps which use spin locks). > > > > For 1x VMs which still vm_exit due to PLE, I believe we could probably > > just leave the ple_window alone, as long as we mostly use yield() > > instead of yield_to(). The problem with the unneeded exits in this case > > has been the overhead in routines leading up to yield_to() and the > > yield_to() itself. If we use yield() most of the time, this overhead > > will go away. > > > > Here is a comparison of yield_to() and yield(): > > > > dbench with 20-way VMs, 8 of them on 80-way host: > > > > no PLE 426 +/- 11.03% > > no PLE w/ gangsched 32001 +/- .37% > > PLE with yield() 29207 +/- .28% > > PLE with yield_to() 8175 +/- 1.37% > > > > Yield() is far and way better than yield_to() here and almost approaches > > gang sched result. Here is a link for the perf sched map bitmap: > > > > https://docs.google.com/open?id=0B6tfUNlZ-14weXBfVnFFZGw1akU > > > > The thrashing is way down and sibling vcpus tend to run together, > > approximating the behavior of the gang scheduling without needing to > > actually implement gang scheduling. > > > > I did test a smaller VM: > > > > dbench with 10-way VMs, 16 of them on 80-way host: > > > > no PLE 6248 +/- 7.69% > > no PLE w/ gangsched 28379 +/- .07% > > PLE with yield() 29196 +/- 1.62% > > PLE with yield_to() 32217 +/- 1.76% > > Hi Andrew, Results are encouraging. > > > > > There is some degrade from yield() to yield_to() here, but nearly as > > large as the uplift we see on the larger VMs. Regardless, I have an > > idea to fix that: Instead of using yield() all the time, we could use > > yield_to(), but limit the rate per vcpu to something like 1 per jiffie. > > All other exits use yield(). That rate of yield_to() should be more > > than enough for the smaller VMs, and the result should be hopefully just > > the same as the current code. I have not coded this up yet, but it's my > > next step. > > I personally feel rate limiting yield_to may be a good idea. > > > > > I am also hopeful the limitation of yield_to() will also make the 1x > > issue just go away as well (even with 4096 ple_window). The vast > > majority of exits will result in yield() which should be harmless. > > > > Keep in mind this did require ensuring sibling vcpus do not share host > > runqueues -I do think that can be possible given some optional scheduler > > tweaks. > > I think this is a concern (placing). Having rate limit alone may > suffice.May be tuning that taking into overcommitted/non-overcommitted > scenario also into account would be better. > > Okay below is my V2 implementation I am experimenting > > 1) check source -and- target runq to decide on exiting the ple handler > 2) > > vcpu_on_spin() > { > > ..... > if yield_to_same_vm did not succeed and we are overcommitted > yield() > > } > > I think combining your thoughts and (2) complicates scenario a bit. > anyways let me see how my experiment goes. I will also check how yield > performs without any pinning. FWIW, below is the latest with throttling yield_to(). Results were slightly higher than the above with just yield(). Although I can see an improvement when not forcing non-shared runqueues among same-VM vcpus (via binding), it's not as effective. I am more concerned this problem requires a multi-part solution, and reducing lock-holder preemption is the other part (by not allowing sequential execution of same-VM vcpus by virtue of sharing runqueues). signed-off-by: Andrew Theurer <habanero@xxxxxxxxxxxxxxxxxx> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index b70b48b..595ef3e 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -153,6 +153,7 @@ struct kvm_vcpu { int mode; unsigned long requests; unsigned long guest_debug; + unsigned long last_yield_to; struct mutex mutex; struct kvm_run *run; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d617f69..1f0ec36 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -49,6 +49,7 @@ #include <linux/slab.h> #include <linux/sort.h> #include <linux/bsearch.h> +#include <linux/jiffies.h> #include <asm/processor.h> #include <asm/io.h> @@ -228,6 +229,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id) vcpu->pid = NULL; init_waitqueue_head(&vcpu->wq); kvm_async_pf_vcpu_init(vcpu); + vcpu->last_yield_to = 0; page = alloc_page(GFP_KERNEL | __GFP_ZERO); if (!page) { @@ -1590,27 +1592,39 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) int i; /* + * A yield_to() can be quite expensive, so we try to limit its use. + * to one per jiffie. Subsequent exits just yield the current vcpu + * in hopes of having it run again when the lock holding vcpu + * gets to run again. This is most effective when vcpus from + * the same VM do not share a runqueue + */ + if (me->last_yield_to == jiffies) { + yield(); + } else { + /* * We boost the priority of a VCPU that is runnable but not * currently running, because it got preempted by something * else and called schedule in __vcpu_run. Hopefully that * VCPU is holding the lock that we need and will release it. * We approximate round-robin by starting at the last boosted VCPU. */ - for (pass = 0; pass < 2 && !yielded; pass++) { - kvm_for_each_vcpu(i, vcpu, kvm) { - if (!pass && i <= last_boosted_vcpu) { - i = last_boosted_vcpu; - continue; - } else if (pass && i > last_boosted_vcpu) - break; - if (vcpu == me) - continue; - if (waitqueue_active(&vcpu->wq)) - continue; - if (kvm_vcpu_yield_to(vcpu)) { - kvm->last_boosted_vcpu = i; - yielded = 1; - break; + for (pass = 0; pass < 2 && !yielded; pass++) { + kvm_for_each_vcpu(i, vcpu, kvm) { + if (!pass && i <= last_boosted_vcpu) { + i = last_boosted_vcpu; + continue; + } else if (pass && i > last_boosted_vcpu) + break; + if (vcpu == me) + continue; + if (waitqueue_active(&vcpu->wq)) + continue; + if (kvm_vcpu_yield_to(vcpu)) { + kvm->last_boosted_vcpu = i; + me->last_yield_to = jiffies; + yielded = 1; + break; + } } } } -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html