Ubuntu 14.04.1 LTS & Qemu 2.0.0 Guest Trouble

Lane Eckley <lane.eckley@xxxxxxxxx> · Fri, 8 Aug 2014 17:35:55 -0400

Hi Guys,

I have recently deployed a new hypervisor with the intent to use in
the hosting of both Linux & Windows virtual machines, however after
getting everything setup I am running into issues where is appears the
virtual machines are "freezing" or "stuttering" for a few seconds at
random intervals.

---

2x Intel Xeon E5-2620 V2
128GB of RAM
8x 480GB Intel 530 SSD's (RAID 10)
LSI 9271-8i
2x 1Gbit NIC's (on-motherboard) - bonded
Supermicro Motherboard - X9DRI-F
Ubuntu 14.04.1
QEMU emulator version 2.0.0 (Debian 2.0.0+dfsg-2ubuntu1.2)
libvirtd (libvirt) 1.2.2
Dnsmasq version 2.68 (DHCP Server)

---

free -m:

             total       used       free     shared    buffers     cached
Mem:        128910      18413     110496          2        134       6349
-/+ buffers/cache:      11929     116980
Swap:        61034          0      61034

Top -c:

top - 18:57:53 up  4:07,  1 user,  load average: 5.85, 5.30, 5.28
Tasks: 372 total,   2 running, 370 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.3 us,  3.5 sy,  0.0 ni, 92.7 id,  0.5 wa,  0.1 hi,  0.0 si,  0.0 st
KiB Mem:  13200402+total, 18855240 used, 11314878+free,   137560 buffers
KiB Swap: 62499836 total,        0 used, 62499836 free.  6502164 cached Mem

---

Known Effected Guest OS's: CentOS 6.5, Windows Server 2012 R2

---

The issue & troubleshooting I have completed:

After a random period of time I will begin to experience bursts of
high latency and packet loss to the guest operating system. When
connecting to the VNC console to investigate the virtual machine I
have confirmed that when the high latency and packet loss bursts occur
the virtual machine VNC output will "freeze" until which time the
burst passes. Once the burst passes the machine will act like nothing
happened and from what I can tell it isn't even aware it froze or time
passed during the event.

Example ping to the guest during the burst of latency and packet loss

64 bytes from x.x.x.x: icmp_seq=6285 ttl=48 time=54.956 ms
64 bytes from x.x.x.x: icmp_seq=6286 ttl=48 time=54.765 ms
64 bytes from x.x.x.x: icmp_seq=6287 ttl=48 time=54.725 ms
64 bytes from x.x.x.x: icmp_seq=6288 ttl=48 time=5091.305 ms
64 bytes from x.x.x.x: icmp_seq=6290 ttl=48 time=3090.609 ms
64 bytes from x.x.x.x: icmp_seq=6289 ttl=48 time=4091.357 ms
64 bytes from x.x.x.x: icmp_seq=6291 ttl=48 time=2090.073 ms
64 bytes from x.x.x.x: icmp_seq=6292 ttl=48 time=1088.983 ms
64 bytes from x.x.x.x: icmp_seq=6293 ttl=48 time=88.455 ms
64 bytes from x.x.x.x: icmp_seq=6294 ttl=48 time=52.370 ms
64 bytes from x.x.x.x: icmp_seq=6295 ttl=48 time=52.087 ms
64 bytes from x.x.x.x: icmp_seq=6296 ttl=48 time=54.872 ms
64 bytes from x.x.x.x: icmp_seq=6297 ttl=48 time=52.708 ms

Example outbound ping from the guest back during the same example interval above

64 bytes from x.x.x.x: icmp_seq=6261 ttl=48 time=53.488 ms
64 bytes from x.x.x.x: icmp_seq=6262 ttl=48 time=50.878 ms
64 bytes from x.x.x.x: icmp_seq=6263 ttl=48 time=52.926 ms
64 bytes from x.x.x.x: icmp_seq=6264 ttl=48 time=51.401 ms
64 bytes from x.x.x.x: icmp_seq=6265 ttl=48 time=54.259 ms
64 bytes from x.x.x.x: icmp_seq=6266 ttl=48 time=52.404 ms
64 bytes from x.x.x.x: icmp_seq=6267 ttl=48 time=55.412 ms
64 bytes from x.x.x.x: icmp_seq=6268 ttl=48 time=69.590 ms
64 bytes from x.x.x.x: icmp_seq=6269 ttl=48 time=54.899 ms
64 bytes from x.x.x.x: icmp_seq=6270 ttl=48 time=53.875 ms
64 bytes from x.x.x.x: icmp_seq=6271 ttl=48 time=52.909 ms
64 bytes from x.x.x.x: icmp_seq=6272 ttl=48 time=53.257 ms
64 bytes from x.x.x.x: icmp_seq=6273 ttl=48 time=53.671 ms

As you can see from the above examples the guest never see's packet
loss outbound during the event, but the inbound ping is erratic to say
the least.

During the bursts of packet loss and high latency events the ping to
the hypervisor's own IP is perfect, it doesn't even slightly hiccup. I
am able to keep an SSH connection throughout the entire event and when
viewing something like "top" I see a constant stream of updates - in
other words, the hypervisor never experiences an issue from what I can
see/tell.

During the testing I also setup a Windows Server 2012 R2 guest and
connected to its VNC console. I opened the task manager so I could see
the graphs so that when the issue begins I could see if i see the
graphs "lurch" forward or if the just stop & start again.

When the event occurred (it took several hours of waiting) I brought
up the VNC connection for the Windows guest VM and watched the task
managers graphs. Each time there was a burst of packet loss and high
latency I would experience the same as the above, the VNC output would
freeze and I couldn't input to it either - the VNC connection remains
connected, it never times out.

After each event the task manager graphs would pick up right where
they left off like nothing ever happened. There isn't a "lurch" or
"jump" forward like you would expect if you simply lost connection to
the guest, but simply when the guest begins to output via VNC its as
if the time never passed.

The only thing I noted was when the output resumes there is a sudden
spike to 100% CPU inside the guest.

Once the bursts begin the typically continue worsening and lessoning
until I do one of the following temporary resolutions.

---

Temporary Resolution:

1) Restart the guest via libvirt or virtsh - Once the guest boots back
up the issue is resolved for a length of time (random length, could be
5 minutes could be 8 hours) and then the issue returns with identical
symptoms as the above.

2) Restart the entire hypervisor - Same as restarting an individual
guest, the issue is resolved for a time, but eventually returns.

---

Notes

- The issue occurs on both Ubuntu 12.04.4 LTS w/Qemu 1.0.0 and on
Ubuntu 14.04.1 LTS w/Qemu 2.0.0
- The issue effects both Windows & Linux guest operating systems
- VirtIO is the default driver for all guests, however I have
confirmed it also effects IDE, Realtek & E1000 when set for the guest
VM.
- When the bursts of packet loss and high latency begins it does NOT
effect all the guest machines simultaneously. While it eventually
effects every guest it doesn't effect them all at the same time. One
guest will be having issues while 5 others are smooth as glass. Its
almost like they each guest has their own timer for when the event
will begin.

---

Any insight into what I may be doing wrong or how to fix the above
would be greatly appreciated!

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html