On Mon, Feb 3, 2014 at 9:31 PM, Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote: > Il 03/02/2014 18:06, Xin Tong ha scritto: > >> /.../qemu-system-x86_64 TID 2537 [TID 2537] (877 ticks/71.24%) > > > This is the CPU thread (calls into the KVM modules). This is the thread that calls the int kvm_cpu_exec(CPUArchState *env) and ioctl. > > >> /.../vmlinux (395 ticks/45.04%) >> /kvm (198 ticks/22.58%) >> /kvm_intel (153 ticks/17.45%) >> /.../qemu-system-x86_64 (71 ticks/8.10%) >> /.../libc-2.15.so (47 ticks/5.36%) >> /.../libpthread-2.15.so (12 ticks/1.37%) >> /libahci (1 ticks/0.11%) >> /.../qemu-system-x86_64 TID 2658 [TID 2658] (200 ticks/16.25%) > > > This is probably the VNC thread. > > >> /.../vmlinux (190 ticks/95.00%) >> /.../libpthread-2.15.so (6 ticks/3.00%) >> /.../libc-2.15.so (2 ticks/1.00%) >> /.../qemu-system-x86_64 (1 ticks/0.50%) >> [vdso] (tgid:2534 range:0x7fff10588000-0x7fff10589fff) (1 >> ticks/0.50%) >> /.../qemu-system-x86_64 TID 2534 [TID 2534] (154 ticks/12.51%) > > > This is the main thread (lowest TID of all). Is this the thread that executes main_loop_wait and handles IO emulation ? > > >> /ksmd [PID 53] (83 ticks/5.36%) >> /.../top [PID 2617] (43 ticks/2.78%) >> /.../unity-2d-shell [PID 1807] (41 ticks/2.65%) > > > Now, going to your questions: > > >> 2. why is qemu-system-x86_64 only taking 8.10% of the time, i imagine >> most of the time should be spent in qemu-system-x86_64 as 403.GCC does >> not do too much IO. > > > This is the same I already answered: the time spent running the guest is in > the context of process qemu-system-x86_64, but it is not running code from > the qemu executable. In fact, 8.10% spent in qemu-system-x86_64 is a lot. > I would expect much less. > > >> 3. why are so much time spent in vmlinux ? the symbols for vmlinux is >> listed below. >> 4. why are so much time spent in kvm-intel and kvm ? the symbols for >> both are listed below. > > > Again, I think I already answered this. The time spent in vmx_vcpu_run is > the actual time spent in the guest (17.45*60.13% = 10% roughly). I see, so this 10% is the actually useful GCC work done from the prospective of the guest. and the 8.10% is the time spent in the QEMU-system-x86_64 process. > > Everything else is overhead introduced by either virtualization or > profiling. > > What SPEC scores are you getting from bare-metal, KVM without oprofile and > KVM with oprofile? Is the profile substantially different if you use perf > instead of oprofile? > > Can you run "scripts/kvm/kvm_stat" (from the QEMU tree) while your guest is > running (under oprofile) and paste the output from that tool? > Data below is from a new 403.gcc run on a haswell linux machine with ref input data set. Profile taken over the entire GCC run. Host linux kernel (compiled myself) version: Linux aristotle 3.13.1+ #3 SMP Mon Feb 3 11:16:00 EST 2014 x86_64 x86_64 x86_64 GNU/Linux Command to run kvm: kvm ~/disks/ubuntu-natty.qcow2 -nographic Commands to oprofile: sudo operf -s --separate-thread --event=CPU_CLK_UNHALTED:5000000 --vmlinux=/home/xtong/xtong-kernel/vmlinux sudo opreport --image-path /home/xtong/xtong-kernel -l -g -d --xml -o current.opm Profiled KVM Performance (Lower is better): 1005 seconds Non-Profiled KVM Performance (Lower is better): 998 seconds Non-Profiled run on host machine (Lower is better): 686 seconds ( the virtualization overhead is still quite large in spite of the hw extension !). running sudo scripts/kvm/kvm_stat gave kvm statistics kvm_exit 105 16 kvm_entry 105 16 kvm_apic 70 10 kvm_emulate_insn 35 5 kvm_eoi 35 5 kvm_mmio 35 5 kvm_inj_virq 35 5 kvm_apic_accept_irq 35 5 Overall: Core0 (53973 ticks/100.00%) /.../qemu-system-x86_64 TID 8861 [TID 8861] (38676 ticks/71.66%) /.../vmlinux (17426 ticks/45.06%) /kvm (7768 ticks/20.08%) /kvm_intel (6882 ticks/17.79%) /.../qemu-system-x86_64 (3535 ticks/9.14%) /.../libc-2.15.so (2384 ticks/6.16%) /.../libpthread-2.15.so (537 ticks/1.39%) /.../librt-2.15.so (82 ticks/0.21%) /.../libglib-2.0.so.0.3200.4 (41 ticks/0.11%) [vdso] (tgid:8858 range:0x7fffe31fe000-0x7fffe31fffff) (16 ticks/0.04%) /libahci (4 ticks/0.01%) /r8169 (1 ticks/0.00%) /.../qemu-system-x86_64 TID 8858 [TID 8858] (6745 ticks/12.50%) /.../qemu-system-x86_64 TID 8880 [TID 8880] (2945 ticks/5.46%) /.../qemu-system-x86_64 TID 8882 [TID 8882] (2835 ticks/5.25%) /.../qemu-system-x86_64 TID 8860 [TID 8860] (2772 ticks/5.14%) vmlinux (45.06%) CPU_CLK_UNHALTED % Symbol/Functions 4583 26.30 native_write_msr_safe 2067 11.86 native_read_msr_safe 1148 6.59 __srcu_read_lock 852 4.89 __srcu_read_unlock 758 4.35 native_load_tr_desc 527 3.02 native_load_gdt 492 2.82 fget_light 313 1.80 __set_current_blocked 270 1.55 _raw_spin_lock_irq 198 1.14 recalc_sigpending 194 1.11 system_call 174 1.00 memcpy 164 0.94 rcu_note_context_switch 164 0.94 fput 160 0.92 user_return_notifier_unregister 158 0.91 SyS_ioctl kvm (20.08%) CPU_CLK_UNHALTED % Symbol/Functions 1949 25.09 kvm_arch_vcpu_ioctl_run 421 5.42 kvm_set_shared_msr 394 5.07 paging64_walk_addr_generic 271 3.49 kvm_on_user_return 256 3.30 x86_emulate_instruction 249 3.21 x86_emulate_insn 221 2.85 em_in 212 2.73 x86_decode_insn 156 2.01 init_emulate_ctxt 145 1.87 emulator_pio_in_out.isra.104 kvm-intel (17.79%) CPU_CLK_UNHALTED % Symbol/Functions 4006 58.21 vmx_vcpu_run 540 7.85 vmcs_writel 472 6.86 vmx_save_host_state 275 4.00 __vmx_load_host_state.part.43 151 2.19 vmx_handle_exit 149 2.17 vmcs_clear_bits 130 1.89 clear_atomic_switch_msr 109 1.58 vmx_set_interrupt_shadow 109 1.58 vmx_handle_external_intr 102 1.48 vmx_get_rflags so, 58.21%*17.79% = 10.3% of the time is spent executing the code from 403.gcc ( i.e. doing useful work), is this correct ? If that is the case, why do not i see ~10X performance degradation. Also I see a lot of time is spent in native_write_msr_safe and native_read_msr_safe in vmlinux, what are these for ? Thanks, Xin > Thanks, > > Paolo > >> >> >> /.../vmlinux (395 ticks/45.04%): >> CPU_CLK_UNHALTED % Symbol/Functions >> 82 20.76 native_write_msr_safe >> 46 11.65 native_read_msr_safe >> 25 6.33 __srcu_read_lock >> 16 4.05 native_load_tr_desc >> 12 3.04 __srcu_read_unlock >> 9 2.28 native_load_gdt >> 8 2.03 fget_light >> 8 2.03 _raw_spin_lock_irq >> 7 1.77 memcpy >> 6 1.52 _raw_spin_lock >> 6 1.52 __set_current_blocked >> 5 1.27 user_return_notifier_unregister >> 5 1.27 update_cfs_shares >> 5 1.27 page_fault >> 5 1.27 bsearch >> >> /kvm (198 ticks/22.58%) >> CPU_CLK_UNHALTED % Symbol/Functions >> 42 21.21 kvm_arch_vcpu_ioctl_run >> 9 4.55 kvm_set_shared_msr >> 7 3.54 x86_emulate_insn >> 7 3.54 paging64_walk_addr_generic >> 7 3.54 kvm_on_user_return >> 7 3.54 do_insn_fetch >> 6 3.03 gfn_to_memslot >> 6 3.03 decode_operand >> 5 2.53 kvm_io_bus_sort_cmp >> 5 2.53 kvm_arch_vcpu_load >> 5 2.53 kvm_apic_accept_pic_intr >> 4 2.02 kvm_fetch_guest_virt >> 4 2.02 kvm_arch_vcpu_runnable >> 4 2.02 emulator_read_gpr >> 4 2.02 apic_has_pending_timer >> 3 1.52 x86_decode_insn >> >> /kvm_intel (153 ticks/17.45%) >> CPU_CLK_UNHALTED % Symbol/Functions >> 92 60.13 vmx_vcpu_run >> 9 5.88 vmx_save_host_state >> 7 4.58 vmx_handle_external_intr >> 7 4.58 vmcs_writel >> 6 3.92 vmx_handle_exit >> 6 3.92 __vmx_load_host_state.part.43 >> 4 2.61 clear_atomic_switch_msr >> 3 1.96 vmx_vm_has_apicv >> 3 1.96 vmx_get_rflags >> 2 1.31 vmx_read_guest_seg_selector >> 2 1.31 vmcs_clear_bits >> 2 1.31 skip_emulated_instruction >> 1 0.65 vmx_set_rflags >> 1 0.65 vmx_vcpu_load >> >> Thank you, >> Xin >> >> >> >> On Mon, Feb 3, 2014 at 2:11 AM, Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote: >>> >>> Il 02/02/2014 16:47, Xin Tong ha scritto: >>> >>>> On Sun, Feb 2, 2014 at 5:37 AM, Paolo Bonzini <pbonzini@xxxxxxxxxx> >>>> wrote: >>>>> >>>>> >>>>> Il 02/02/2014 03:08, Xin Tong ha scritto: >>>>>> >>>>>> >>>>>> >>>>>> I am getting very weird profile results by running operf on linux on >>>>>> the host and profiling the a kvm virtual machine running dacapo >>>>>> eclipse benchmark. I am expecting a lot of time should be spent in >>>>>> the qemu-system-x86_64 as the instructions from the eclipse benchmark >>>>>> would be treated as part of the qemu-system-x86_64 process, but the >>>>>> results tell different. any suggestions ? >>>>> >>>>> >>>>> >>>>> >>>>> Most of the time should be spent running the guest. This is in the >>>>> context >>>>> of process qemu-system-x86_64, but it is not running code from the qemu >>>>> executable. >>>> >>>> >>>> >>>> That is what i was trying to say. you said it better. >>>>> >>>>> >>>>> >>>>> What likely happens is that when the profiling counter expires, it >>>>> causes >>>>> the VM to exit before the profiling interrupt is delivered. The sample >>>>> should then be associated to the kvm_intel module. >>>> >>>> >>>> >>>> could not the kvm module read the counters and inject an counter >>>> overflow interrupt into the guest ? what are some ways to make the >>>> profile more accurate. this is collected on a intel haswell host >>>> machine. >>> >>> >>> >>> Yes, it can. But then you have to run perf/operf in the guest not in the >>> host. >>> >>> For this to work, you need to specify "-cpu host" on the QEMU command >>> line. >>> >>> Paolo >> >> -- >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html