On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote: > On 10/10/2012 08:29 AM, Andrew Theurer wrote: > > On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote: > >> * Avi Kivity <avi@xxxxxxxxxx> [2012-10-04 17:00:28]: > >> > >>> On 10/04/2012 03:07 PM, Peter Zijlstra wrote: > >>>> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote: > >>>>> > >>>>> Again the numbers are ridiculously high for arch_local_irq_restore. > >>>>> Maybe there's a bad perf/kvm interaction when we're injecting an > >>>>> interrupt, I can't believe we're spending 84% of the time running the > >>>>> popf instruction. > >>>> > >>>> Smells like a software fallback that doesn't do NMI, hrtimer based > >>>> sampling typically hits popf where we re-enable interrupts. > >>> > >>> Good nose, that's probably it. Raghavendra, can you ensure that the PMU > >>> is properly exposed? 'dmesg' in the guest will tell. If it isn't, -cpu > >>> host will expose it (and a good idea anyway to get best performance). > >>> > >> > >> Hi Avi, you are right. SandyBridge machine result was not proper. > >> I cleaned up the services, enabled PMU, re-ran all the test again. > >> > >> Here is the summary: > >> We do get good benefit by increasing ple window. Though we don't > >> see good benefit for kernbench and sysbench, for ebizzy, we get huge > >> improvement for 1x scenario. (almost 2/3rd of ple disabled case). > >> > >> Let me know if you think we can increase the default ple_window > >> itself to 16k. > >> > >> I am experimenting with V2 version of undercommit improvement(this) patch > >> series, But I think if you wish to go for increase of > >> default ple_window, then we would have to measure the benefit of patches > >> when ple_window = 16k. > >> > >> I can respin the whole series including this default ple_window change. > >> > >> I also have the perf kvm top result for both ebizzy and kernbench. > >> I think they are in expected lines now. > >> > >> Improvements > >> ================ > >> > >> 16 core PLE machine with 16 vcpu guest > >> > >> base = 3.6.0-rc5 + ple handler optimization patches > >> base_pleopt_16k = base + ple_window = 16k > >> base_pleopt_32k = base + ple_window = 32k > >> base_pleopt_nople = base + ple_gap = 0 > >> kernbench, hackbench, sysbench (time in sec lower is better) > >> ebizzy (rec/sec higher is better) > >> > >> % improvements w.r.t base (ple_window = 4k) > >> ---------------+---------------+-----------------+-------------------+ > >> |base_pleopt_16k| base_pleopt_32k | base_pleopt_nople | > >> ---------------+---------------+-----------------+-------------------+ > >> kernbench_1x | 0.42371 | 1.15164 | 0.09320 | > >> kernbench_2x | -1.40981 | -17.48282 | -570.77053 | > >> ---------------+---------------+-----------------+-------------------+ > >> sysbench_1x | -0.92367 | 0.24241 | -0.27027 | > >> sysbench_2x | -2.22706 |-0.30896 | -1.27573 | > >> sysbench_3x | -0.75509 | 0.09444 | -2.97756 | > >> ---------------+---------------+-----------------+-------------------+ > >> ebizzy_1x | 54.99976 | 67.29460 | 74.14076 | > >> ebizzy_2x | -8.83386 |-27.38403 | -96.22066 | > >> ---------------+---------------+-----------------+-------------------+ > >> > >> perf kvm top observation for kernbench and ebizzy (nople, 4k, 32k window) > >> ======================================================================== > > > > Is the perf data for 1x overcommit? > > Yes, 16vcpu guest on 16 core > > > > >> pleopt ple_gap=0 > >> -------------------- > >> ebizzy : 18131 records/s > >> 63.78% [guest.kernel] [g] _raw_spin_lock_irqsave > >> 5.65% [guest.kernel] [g] smp_call_function_many > >> 3.12% [guest.kernel] [g] clear_page > >> 3.02% [guest.kernel] [g] down_read_trylock > >> 1.85% [guest.kernel] [g] async_page_fault > >> 1.81% [guest.kernel] [g] up_read > >> 1.76% [guest.kernel] [g] native_apic_mem_write > >> 1.70% [guest.kernel] [g] find_vma > > > > Does 'perf kvm top' not give host samples at the same time? Would be > > nice to see the host overhead as a function of varying ple window. I > > would expect that to be the major difference between 4/16/32k window > > sizes. > > No, I did something like this > perf kvm --guestvmlinux ./vmlinux.guest top -g -U -d 3. Yes that is a > good idea. > > (I am getting some segfaults with perf top, I think it is already fixed > but yet to see the patch that fixes) > > > > > > > A big concern I have (if this is 1x overcommit) for ebizzy is that it > > has just terrible scalability to begin with. I do not think we should > > try to optimize such a bad workload. > > > > I think my way of running dbench has some flaw, so I went to ebizzy. > Could you let me know how you generally run dbench? I mount a tmpfs and then specify that mount for dbench to run on. This eliminates all IO. I use a 300 second run time and number of threads is equal to number of vcpus. All of the VMs of course need to have a synchronized start. I would also make sure you are using a recent kernel for dbench, where the dcache scalability is much improved. Without any lock-holder preemption, the time in spin_lock should be very low: > 21.54% 78016 dbench [kernel.kallsyms] [k] copy_user_generic_unrolled > 3.51% 12723 dbench libc-2.12.so [.] __strchr_sse42 > 2.81% 10176 dbench dbench [.] child_run > 2.54% 9203 dbench [kernel.kallsyms] [k] _raw_spin_lock > 2.33% 8423 dbench dbench [.] next_token > 2.02% 7335 dbench [kernel.kallsyms] [k] __d_lookup_rcu > 1.89% 6850 dbench libc-2.12.so [.] __strstr_sse42 > 1.53% 5537 dbench libc-2.12.so [.] __memset_sse2 > 1.47% 5337 dbench [kernel.kallsyms] [k] link_path_walk > 1.40% 5084 dbench [kernel.kallsyms] [k] kmem_cache_alloc > 1.38% 5009 dbench libc-2.12.so [.] memmove > 1.24% 4496 dbench libc-2.12.so [.] vfprintf > 1.15% 4169 dbench [kernel.kallsyms] [k] __audit_syscall_exit -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html