[User Question] Repeated severe performance problems on guest

Martin Wawro <martin.wawro@xxxxxxxxx> · Fri, 12 Apr 2013 17:04:27 +0200

Hi All,

We are experiencing severe performance problems on a regular basis
which require us to destroy and restart the guest OS. What happens is that
the load average rises well above 50 and the guest OS becomes quite
unresponsive,
though no serious workload is running on the system. Even issuing a
clean reboot
command takes so long, that we have to abort that by simply cutting off
the guest instead
of waiting for around 40mins for the clean reboot to happen.

These problems arise on a regular basis, though there is no apparent
pattern behind
that. It may occur on larger workloads during daytime, but also during
nighttime
when there is virtually no load on the system. There are no associated
kernel messages
and the host does seem to work OK, though we think that also rebooting
the host after
such an incident prolongs the time to the next occurence of the problem.

The system in general serves around 50 GB incoming traffic and 90 GB
outgoing
traffic per day (it is a kind of fileserver). With 90% of the traffic
occuring
during the daytime.

Logging the kvm_stat on the host, we obtained the following output during
normal operation:

efer_reload                    0         0
exits                 5101302255      1974
fpu_reload             306359390        62
halt_exits             737202541       764
halt_wakeup            327442392        65
host_state_reload     2065997773       912
hypercalls                     0         0
insn_emulation        1702740746       914
insn_emulation_fail            0         0
invlpg                         0         0
io_exits              1352400686       148
irq_exits              736230648        38
irq_injections         881709782       767
irq_window              17402610         0
largepages                326880         0
mmio_exits               2951391         0
mmu_cache_miss           2986088         0
mmu_flooded                    0         0
mmu_pde_zapped                 0         0
mmu_pte_updated                0         0
mmu_pte_write             108123         0
mmu_recycled                   0         0
mmu_shadow_zapped        3178728         0
mmu_unsync                     0         0
nmi_injections                 0         0
nmi_window                     0         0
pf_fixed                84440791         0
pf_guest                       0         0
remote_tlb_flush        37610010         8
request_irq                    0         0
signal_exits                   0         0
tlb_flush                      0         0

and about 90 mins later, the output when the guest is in a
state where it is rather unresponsive looks like this:

efer_reload                    0         0
exits                 5125445200     21349
fpu_reload             307627942       119
halt_exits             741717495       792
halt_wakeup            328747102       108
host_state_reload     2075042930      1330
hypercalls                     0         0
insn_emulation        1711070317      1135
insn_emulation_fail            0         0
invlpg                         0         0
io_exits              1356868798       424
irq_exits              738940729       155
irq_injections         886685967      1012
irq_window              17463827         3
largepages                321488        18
mmio_exits               3062654        90
mmu_cache_miss           3552726      5581
mmu_flooded                    0         0
mmu_pde_zapped                 0         0
mmu_pte_updated                0         0
mmu_pte_write             108123         0
mmu_recycled                   0         0
mmu_shadow_zapped        3781317      5396
mmu_unsync                     0         0
nmi_injections                 0         0
nmi_window                     0         0
pf_fixed                86464743     18627
pf_guest                       0         0
remote_tlb_flush        37881302       137
request_irq                    0         0
signal_exits                   0         0
tlb_flush                      0         0

Our attempts to extract some valuable information from the logs inside
the guest OS were not exactly successful. We could not find anything
unusual as compared to normal operation, except for the huge load average.

We are running the following setup:

Host OS:
RHEL 6.3 (amd64) Kernel 2.6.32-279.22
qemu-kvm 0.12.1.2-2.295

Guest OS:
Ubuntu Server 10.04 (amd64) Kernel 2.6.32-45
Assigned CPU cores: 7 (we also tested single CPU pinning too, without
success)
Assigned Memory: 32GB
All harddrives / network paravirtualized using virtio
The filesystem in use is mainly xfs.

Hardware:
IBM BladeServer HS22 with 44 GB memory and 2 Xeon QC CPUs (E5506)

Only a single guest is running on that machine.

We messed around with a lot of parameters (including clocksource, APICs
etc.),
but none of them seems to have an effect on the problem other than just
prolonging or shortening (and even this we cannot tell for sure due to some
randomness involved) the interval to the next catastrophic failure.

Any hints on how to approach that problem are welcome, since we are out of
ideas over here.

Best regards,

Martin

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html