Hi All, We are experiencing severe performance problems on a regular basis which require us to destroy and restart the guest OS. What happens is that the load average rises well above 50 and the guest OS becomes quite unresponsive, though no serious workload is running on the system. Even issuing a clean reboot command takes so long, that we have to abort that by simply cutting off the guest instead of waiting for around 40mins for the clean reboot to happen. These problems arise on a regular basis, though there is no apparent pattern behind that. It may occur on larger workloads during daytime, but also during nighttime when there is virtually no load on the system. There are no associated kernel messages and the host does seem to work OK, though we think that also rebooting the host after such an incident prolongs the time to the next occurence of the problem. The system in general serves around 50 GB incoming traffic and 90 GB outgoing traffic per day (it is a kind of fileserver). With 90% of the traffic occuring during the daytime. Logging the kvm_stat on the host, we obtained the following output during normal operation: efer_reload 0 0 exits 5101302255 1974 fpu_reload 306359390 62 halt_exits 737202541 764 halt_wakeup 327442392 65 host_state_reload 2065997773 912 hypercalls 0 0 insn_emulation 1702740746 914 insn_emulation_fail 0 0 invlpg 0 0 io_exits 1352400686 148 irq_exits 736230648 38 irq_injections 881709782 767 irq_window 17402610 0 largepages 326880 0 mmio_exits 2951391 0 mmu_cache_miss 2986088 0 mmu_flooded 0 0 mmu_pde_zapped 0 0 mmu_pte_updated 0 0 mmu_pte_write 108123 0 mmu_recycled 0 0 mmu_shadow_zapped 3178728 0 mmu_unsync 0 0 nmi_injections 0 0 nmi_window 0 0 pf_fixed 84440791 0 pf_guest 0 0 remote_tlb_flush 37610010 8 request_irq 0 0 signal_exits 0 0 tlb_flush 0 0 and about 90 mins later, the output when the guest is in a state where it is rather unresponsive looks like this: efer_reload 0 0 exits 5125445200 21349 fpu_reload 307627942 119 halt_exits 741717495 792 halt_wakeup 328747102 108 host_state_reload 2075042930 1330 hypercalls 0 0 insn_emulation 1711070317 1135 insn_emulation_fail 0 0 invlpg 0 0 io_exits 1356868798 424 irq_exits 738940729 155 irq_injections 886685967 1012 irq_window 17463827 3 largepages 321488 18 mmio_exits 3062654 90 mmu_cache_miss 3552726 5581 mmu_flooded 0 0 mmu_pde_zapped 0 0 mmu_pte_updated 0 0 mmu_pte_write 108123 0 mmu_recycled 0 0 mmu_shadow_zapped 3781317 5396 mmu_unsync 0 0 nmi_injections 0 0 nmi_window 0 0 pf_fixed 86464743 18627 pf_guest 0 0 remote_tlb_flush 37881302 137 request_irq 0 0 signal_exits 0 0 tlb_flush 0 0 Our attempts to extract some valuable information from the logs inside the guest OS were not exactly successful. We could not find anything unusual as compared to normal operation, except for the huge load average. We are running the following setup: Host OS: RHEL 6.3 (amd64) Kernel 2.6.32-279.22 qemu-kvm 0.12.1.2-2.295 Guest OS: Ubuntu Server 10.04 (amd64) Kernel 2.6.32-45 Assigned CPU cores: 7 (we also tested single CPU pinning too, without success) Assigned Memory: 32GB All harddrives / network paravirtualized using virtio The filesystem in use is mainly xfs. Hardware: IBM BladeServer HS22 with 44 GB memory and 2 Xeon QC CPUs (E5506) Only a single guest is running on that machine. We messed around with a lot of parameters (including clocksource, APICs etc.), but none of them seems to have an effect on the problem other than just prolonging or shortening (and even this we cannot tell for sure due to some randomness involved) the interval to the next catastrophic failure. Any hints on how to approach that problem are welcome, since we are out of ideas over here. Best regards, Martin -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html