Re: [PATCH] KVM: Use thread debug register storage instead of kvm specific data

Andrew Theurer <habanero@xxxxxxxxxxxxxxxxxx> · Fri, 04 Sep 2009 09:48:17 -0500

On Tue, 2009-09-01 at 21:23 +0300, Avi Kivity wrote:
> On 09/01/2009 09:12 PM, Andrew Theurer wrote:
> > Here's a run from branch debugreg with thread debugreg storage +
> > conditionally reload dr6:
> >
> > user  nice  system   irq  softirq guest   idle  iowait
> > 5.79  0.00    9.28  0.08     1.00 20.81  58.78    4.26
> > total busy: 36.97
> >
> > Previous run that had avoided calling adjust_vmx_controls twice:
> >
> > user  nice  system   irq  softirq guest   idle  iowait
> > 5.81  0.00    9.48  0.08    1.04  21.32  57.86    4.41
> > total busy: 37.73
> >
> > A relative reduction CPU cycles of 2%
> >    
> 
> That was an wasy fruit to pick.  To bad it was a regression that we 
> introduced.
> 
> > new oprofile:
> >
> >    
> >> samples  %        app name                 symbol name
> >> 876648   54.1555  kvm-intel.ko             vmx_vcpu_run
> >> 37595     2.3225  qemu-system-x86_64       cpu_physical_memory_rw
> >> 35623     2.2006  qemu-system-x86_64       phys_page_find_alloc
> >> 24874     1.5366  vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 native_write_msr_safe
> >> 17710     1.0940  libc-2.5.so              memcpy
> >> 14664     0.9059  kvm.ko                   kvm_arch_vcpu_ioctl_run
> >> 14577     0.9005  qemu-system-x86_64       qemu_get_ram_ptr
> >> 12528     0.7739  vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 native_read_msr_safe
> >> 10979     0.6782  vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 copy_user_generic_string
> >> 9979      0.6165  qemu-system-x86_64       virtqueue_get_head
> >> 9371      0.5789  vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 schedule
> >> 8333      0.5148  qemu-system-x86_64       virtqueue_avail_bytes
> >> 7899      0.4880  vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 fget_light
> >> 7289      0.4503  qemu-system-x86_64       main_loop_wait
> >> 7217      0.4458  qemu-system-x86_64       lduw_phys
> >>      
> 
> This is almost entirely host virtio.  I can reduce native_write_msr_safe 
> by a bit, but not much.
> 
> >> 6821      0.4214  vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 audit_syscall_exit
> >> 6749      0.4169  vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 do_select
> >> 5919      0.3657  vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 audit_syscall_entry
> >> 5466      0.3377  vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 kfree
> >> 4887      0.3019  vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 fput
> >> 4689      0.2897  vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 __switch_to
> >> 4636      0.2864  vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 mwait_idle
> >>      
> 
> Still not idle=poll, it may shave off 0.2%.

Won't this affect SMT in a negative way?  (OK, I am not running SMT now,
but eventually we will be) A long time ago, we tested P4's with HT, and
a polling idle in one thread always negatively impacted performance in
the sibling thread.

FWIW, I did try idle=halt, and it was slightly worse.

I did get a chance to try the latest qemu (master and next heads).  I
have been running into a problem with virtIO stor driver for windows on
anything much newer than kvm-87.  I compiled the driver from the new git
tree, installed OK, but still had the same error.  Finally, I removed
the serial number feature in the virtio-blk in qemu, and I can now get
the driver to work in Windows.

So, not really any good news on performance with latest qemu builds.
Performance is slightly worse:

qemu-kvm-87
user  nice  system   irq  softirq guest   idle  iowait
5.79  0.00    9.28  0.08     1.00 20.81  58.78    4.26
total busy: 36.97

qemu-kvm-88-905-g6025b2d (master)
user  nice  system   irq  softirq guest   idle  iowait
6.57  0.00   10.86  0.08     1.02 21.35  55.90    4.21
total busy: 39.89

qemu-kvm-88-910-gbf8a05b (next)
user  nice  system   irq  softirq guest   idle  iowait
6.60  0.00  10.91   0.09     1.03 21.35  55.71    4.31
total busy: 39.98

diff of profiles, p1=qemu-kvm-87, p2=qemu-master

> profile1 is qemu-kvm-87
> profile2 is qemu-master
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 10000000
> total samples (ts1) for profile1 is 1616921 
> total samples (ts2) for profile2 is 1752347 (includes multiplier of 0.995420)
> functions which have a abs(pct2-pct1) < 0.06 are not displayed
> 
>                               pct2:   pct1:                                      
>                                100*    100*  pct2                                
>        s1        s2   s2/s1  s2/ts1  s1/ts1  -pct1 symbol                     bin
> --------- --------- ------- ------- ------- ------ ------                     ---
>    879611    907883  1.03/1  56.149  54.400  1.749 vmx_vcpu_run               kvm
>       614     11553 18.82/1   0.715   0.038  0.677 gfn_to_memslot_unali    kvm.ko
>     34511     44922  1.30/1   2.778   2.134  0.644 phys_page_find_alloc      qemu
>      2866      9334  3.26/1   0.577   0.177  0.400 paging64_walk_addr      kvm.ko
>     11139     17200  1.54/1   1.064   0.689  0.375 copy_user_generic_st   vmlinux
>      3100      7108  2.29/1   0.440   0.192  0.248 x86_decode_insn         kvm.ko
>      8169     11873  1.45/1   0.734   0.505  0.229 virtqueue_avail_byte      qemu
>      1103      4540  4.12/1   0.281   0.068  0.213 kvm_read_guest          kvm.ko
>     17427     20401  1.17/1   1.262   1.078  0.184 memcpy                    libc
>         0      2905           0.180   0.000  0.180 gfn_to_pfn              kvm.ko
>      1831      4328  2.36/1   0.268   0.113  0.154 x86_emulate_insn        kvm.ko
>        65      2431 37.41/1   0.150   0.004  0.146 emulator_read_emulat    kvm.ko
>     14922     17196  1.15/1   1.064   0.923  0.141 qemu_get_ram_ptr          qemu
>       545      2724  5.00/1   0.168   0.034  0.135 emulate_instruction     kvm.ko
>       599      2464  4.11/1   0.152   0.037  0.115 kvm_read_guest_page     kvm.ko
>       503      2355  4.68/1   0.146   0.031  0.115 gfn_to_hva              kvm.ko
>      1076      2918  2.71/1   0.181   0.067  0.114 memcpy_c               vmlinux
>       594      2241  3.77/1   0.139   0.037  0.102 next_segment            kvm.ko
>      1680      3248  1.93/1   0.201   0.104  0.097 pipe_poll              vmlinux
>         0      1463           0.090   0.000  0.090 subpage_readl             qemu
>         0      1363           0.084   0.000  0.084 msix_enabled              qemu
>       527      1883  3.57/1   0.116   0.033  0.084 paging64_gpte_to_gfn    kvm.ko
>       962      2223  2.31/1   0.138   0.059  0.078 do_insn_fetch           kvm.ko
>       348      1605  4.61/1   0.099   0.022  0.078 is_rsvd_bits_set        kvm.ko
>       520      1763  3.39/1   0.109   0.032  0.077 unalias_gfn             kvm.ko
>         1      1163 1163.65/1   0.072   0.000  0.072 tdp_page_fault          kvm.ko
>      3827      4912  1.28/1   0.304   0.237  0.067 __down_read            vmlinux
>         0      1014           0.063   0.000  0.063 mapping_level           kvm.ko
>       973         0           0.000   0.060 -0.060 pm_ioport_readl           qemu
>      1635       528  1/3.09   0.033   0.101 -0.068 ioport_read               qemu
>      2179      1017  1/2.14   0.063   0.135 -0.072 kvm_emulate_pio         kvm.ko
>     25141     23722  1/1.06   1.467   1.555 -0.088 native_write_msr_saf   vmlinux
>      1560         0           0.000   0.096 -0.096 eventfd_poll           vmlinux
>                             ------- ------- ------  
>                             105.100  97.450  7.650  

18x more samples for gfn_to_memslot_unali*, 37x for
emulator_read_emula*, and more CPU time in guest mode.

One other thing I decided to try was some cpu binding.  I know this is
not practical for production, but I wanted to see if there's any benefit
at all.  One reason was that a coworker here tried binding the qemu
thread for the vcpu and the qemu IO thread to the same cpu.  On a
networking test, guest->local-host, throughput was up about 2x.
Obviously there was a nice effect of being on the same cache.  I
wondered, even without full bore throughput tests, could we see any
benefit here.  So, I bound each pair of VMs to a dedicated core.  What I
saw was about a 6% improvement in performance.  For a system which has
pretty incredible memory performance and is not that busy, I was
surprised that I got 6%.  I am not advocating binding, but what I do
wonder:  on 1-way VMs, if we keep all the qemu threads together on the
same CPU, but still allowing the scheduler to move them (all of them at
once) to different cpus over time, would we see the same benefit?

One other thing:  So far I have not been using preadv/pwritev.  I assume
I need a more recent glibc (on 2.5 now) for qemu to take advantage of
this?

Thanks!

-Andrew

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html