On Tue, 2009-09-01 at 21:23 +0300, Avi Kivity wrote: > On 09/01/2009 09:12 PM, Andrew Theurer wrote: > > Here's a run from branch debugreg with thread debugreg storage + > > conditionally reload dr6: > > > > user nice system irq softirq guest idle iowait > > 5.79 0.00 9.28 0.08 1.00 20.81 58.78 4.26 > > total busy: 36.97 > > > > Previous run that had avoided calling adjust_vmx_controls twice: > > > > user nice system irq softirq guest idle iowait > > 5.81 0.00 9.48 0.08 1.04 21.32 57.86 4.41 > > total busy: 37.73 > > > > A relative reduction CPU cycles of 2% > > > > That was an wasy fruit to pick. To bad it was a regression that we > introduced. > > > new oprofile: > > > > > >> samples % app name symbol name > >> 876648 54.1555 kvm-intel.ko vmx_vcpu_run > >> 37595 2.3225 qemu-system-x86_64 cpu_physical_memory_rw > >> 35623 2.2006 qemu-system-x86_64 phys_page_find_alloc > >> 24874 1.5366 vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 native_write_msr_safe > >> 17710 1.0940 libc-2.5.so memcpy > >> 14664 0.9059 kvm.ko kvm_arch_vcpu_ioctl_run > >> 14577 0.9005 qemu-system-x86_64 qemu_get_ram_ptr > >> 12528 0.7739 vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 native_read_msr_safe > >> 10979 0.6782 vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 copy_user_generic_string > >> 9979 0.6165 qemu-system-x86_64 virtqueue_get_head > >> 9371 0.5789 vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 schedule > >> 8333 0.5148 qemu-system-x86_64 virtqueue_avail_bytes > >> 7899 0.4880 vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 fget_light > >> 7289 0.4503 qemu-system-x86_64 main_loop_wait > >> 7217 0.4458 qemu-system-x86_64 lduw_phys > >> > > This is almost entirely host virtio. I can reduce native_write_msr_safe > by a bit, but not much. > > >> 6821 0.4214 vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 audit_syscall_exit > >> 6749 0.4169 vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 do_select > >> 5919 0.3657 vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 audit_syscall_entry > >> 5466 0.3377 vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 kfree > >> 4887 0.3019 vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 fput > >> 4689 0.2897 vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 __switch_to > >> 4636 0.2864 vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 mwait_idle > >> > > Still not idle=poll, it may shave off 0.2%. Won't this affect SMT in a negative way? (OK, I am not running SMT now, but eventually we will be) A long time ago, we tested P4's with HT, and a polling idle in one thread always negatively impacted performance in the sibling thread. FWIW, I did try idle=halt, and it was slightly worse. I did get a chance to try the latest qemu (master and next heads). I have been running into a problem with virtIO stor driver for windows on anything much newer than kvm-87. I compiled the driver from the new git tree, installed OK, but still had the same error. Finally, I removed the serial number feature in the virtio-blk in qemu, and I can now get the driver to work in Windows. So, not really any good news on performance with latest qemu builds. Performance is slightly worse: qemu-kvm-87 user nice system irq softirq guest idle iowait 5.79 0.00 9.28 0.08 1.00 20.81 58.78 4.26 total busy: 36.97 qemu-kvm-88-905-g6025b2d (master) user nice system irq softirq guest idle iowait 6.57 0.00 10.86 0.08 1.02 21.35 55.90 4.21 total busy: 39.89 qemu-kvm-88-910-gbf8a05b (next) user nice system irq softirq guest idle iowait 6.60 0.00 10.91 0.09 1.03 21.35 55.71 4.31 total busy: 39.98 diff of profiles, p1=qemu-kvm-87, p2=qemu-master > profile1 is qemu-kvm-87 > profile2 is qemu-master > Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 10000000 > total samples (ts1) for profile1 is 1616921 > total samples (ts2) for profile2 is 1752347 (includes multiplier of 0.995420) > functions which have a abs(pct2-pct1) < 0.06 are not displayed > > pct2: pct1: > 100* 100* pct2 > s1 s2 s2/s1 s2/ts1 s1/ts1 -pct1 symbol bin > --------- --------- ------- ------- ------- ------ ------ --- > 879611 907883 1.03/1 56.149 54.400 1.749 vmx_vcpu_run kvm > 614 11553 18.82/1 0.715 0.038 0.677 gfn_to_memslot_unali kvm.ko > 34511 44922 1.30/1 2.778 2.134 0.644 phys_page_find_alloc qemu > 2866 9334 3.26/1 0.577 0.177 0.400 paging64_walk_addr kvm.ko > 11139 17200 1.54/1 1.064 0.689 0.375 copy_user_generic_st vmlinux > 3100 7108 2.29/1 0.440 0.192 0.248 x86_decode_insn kvm.ko > 8169 11873 1.45/1 0.734 0.505 0.229 virtqueue_avail_byte qemu > 1103 4540 4.12/1 0.281 0.068 0.213 kvm_read_guest kvm.ko > 17427 20401 1.17/1 1.262 1.078 0.184 memcpy libc > 0 2905 0.180 0.000 0.180 gfn_to_pfn kvm.ko > 1831 4328 2.36/1 0.268 0.113 0.154 x86_emulate_insn kvm.ko > 65 2431 37.41/1 0.150 0.004 0.146 emulator_read_emulat kvm.ko > 14922 17196 1.15/1 1.064 0.923 0.141 qemu_get_ram_ptr qemu > 545 2724 5.00/1 0.168 0.034 0.135 emulate_instruction kvm.ko > 599 2464 4.11/1 0.152 0.037 0.115 kvm_read_guest_page kvm.ko > 503 2355 4.68/1 0.146 0.031 0.115 gfn_to_hva kvm.ko > 1076 2918 2.71/1 0.181 0.067 0.114 memcpy_c vmlinux > 594 2241 3.77/1 0.139 0.037 0.102 next_segment kvm.ko > 1680 3248 1.93/1 0.201 0.104 0.097 pipe_poll vmlinux > 0 1463 0.090 0.000 0.090 subpage_readl qemu > 0 1363 0.084 0.000 0.084 msix_enabled qemu > 527 1883 3.57/1 0.116 0.033 0.084 paging64_gpte_to_gfn kvm.ko > 962 2223 2.31/1 0.138 0.059 0.078 do_insn_fetch kvm.ko > 348 1605 4.61/1 0.099 0.022 0.078 is_rsvd_bits_set kvm.ko > 520 1763 3.39/1 0.109 0.032 0.077 unalias_gfn kvm.ko > 1 1163 1163.65/1 0.072 0.000 0.072 tdp_page_fault kvm.ko > 3827 4912 1.28/1 0.304 0.237 0.067 __down_read vmlinux > 0 1014 0.063 0.000 0.063 mapping_level kvm.ko > 973 0 0.000 0.060 -0.060 pm_ioport_readl qemu > 1635 528 1/3.09 0.033 0.101 -0.068 ioport_read qemu > 2179 1017 1/2.14 0.063 0.135 -0.072 kvm_emulate_pio kvm.ko > 25141 23722 1/1.06 1.467 1.555 -0.088 native_write_msr_saf vmlinux > 1560 0 0.000 0.096 -0.096 eventfd_poll vmlinux > ------- ------- ------ > 105.100 97.450 7.650 18x more samples for gfn_to_memslot_unali*, 37x for emulator_read_emula*, and more CPU time in guest mode. One other thing I decided to try was some cpu binding. I know this is not practical for production, but I wanted to see if there's any benefit at all. One reason was that a coworker here tried binding the qemu thread for the vcpu and the qemu IO thread to the same cpu. On a networking test, guest->local-host, throughput was up about 2x. Obviously there was a nice effect of being on the same cache. I wondered, even without full bore throughput tests, could we see any benefit here. So, I bound each pair of VMs to a dedicated core. What I saw was about a 6% improvement in performance. For a system which has pretty incredible memory performance and is not that busy, I was surprised that I got 6%. I am not advocating binding, but what I do wonder: on 1-way VMs, if we keep all the qemu threads together on the same CPU, but still allowing the scheduler to move them (all of them at once) to different cpus over time, would we see the same benefit? One other thing: So far I have not been using preadv/pwritev. I assume I need a more recent glibc (on 2.5 now) for qemu to take advantage of this? Thanks! -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html