[ kvm-Bugs-1802082 ] Networking Dies Under Heavy Load

"SourceForge.net" <noreply@xxxxxxxxxxxxxxx> · Fri, 11 Jun 2010 13:58:35 +0000

Bugs item #1802082, was opened at 2007-09-25 09:44
Message generated for change (Comment added) made by esila
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=1802082&group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Eckie Silapaswang (esila)
Assigned to: Nobody/Anonymous (nobody)
Summary: Networking Dies Under Heavy Load

Initial Comment:
Running a stress test of kvm using an EnGarde Secure Linux 1.5 guest OS.  Under a heavy network email load, the guest OS networking gets knocked out - unable to ping, ssh, etc.  Can only get things started again by going into vncviewer and restarting the networking services from there.

CPUs: 8 x Intel(R) Xeon(R) CPU E5335  @ 2.00GHz
KVM 33-4
Host Kernel: 2.6.23-rc3
Kernel Arch: x86_64
Guest OS: EnGarde Secure Linux 32bit i686,  2.4.31-1.5.60

Command Line:
/usr/bin/qemu-system -hda /root/images/bwimail01.img -boot c -m 384 -smp 4 -std-vga -net nic,vlan=0,macaddr=52:54:00:12:34:6F -net tap,ifname=tap1,script=/etc/qemu-ifup -vnc 192.168.1.57:1 &

Cannot boot guest with the -no-kvm switch.

Can provide remote access to the guest OS if needed for debugging purposes.  Any help appreciated.

Best,
Eckie

----------------------------------------------------------------------

>Comment By: Eckie Silapaswang (esila)
Date: 2010-06-11 08:58

Message:
jessorensen,

confirmed that with recent QEMU/KVM there are no networking issues - you
may close this out, thank you!

----------------------------------------------------------------------

Comment By: Jes Sorensen (jessorensen)
Date: 2010-06-11 04:19

Message:
Hi,

Could you please let us know if this is still a problem with recent
QEMU/KVM? If not, lets close this bug.

Thanks,
Jes

----------------------------------------------------------------------

Comment By: Arne Kepp (arneke)
Date: 2008-02-24 23:14

Message:
Logged In: YES 
user_id=822860
Originator: NO

Just adding what I wrote on the -devel list:

I saw this problem using the rtl8139 driver on 2.6.18-53.1.13.el5 (Red Hat
kernel, CentOS).

ethtool -S eth0  on the guest says, when locked up:
NIC statistics:
  early_rx: 0
  tx_buf_mapped: 0
  tx_timeouts: 4
  rx_lost_in_ring: 0

Adding noapic appears to resolve the issue.

The host system is runing KVM-61 on the same kernel as the guest, it's a
quad core x86_64.

----------------------------------------------------------------------

Comment By: Darrin Eden (darrineden)
Date: 2007-12-22 03:17

Message:
Logged In: YES 
user_id=1964687
Originator: NO

Izik, 
  Switching to rtl8139 appears to have corrected the problem.
  Thank you.

----------------------------------------------------------------------

Comment By: Izik Eidus (izike)
Date: 2007-12-21 14:55

Message:
Logged In: YES 
user_id=1851802
Originator: NO

i think that i remember that someone reported that it was solved to him by
using rtl8139
can you please try?
(qemu command: qemu-system-x86_64 -m 1536 -smp 4 -net
nic,model=rtl8139,macaddr=52:54:00:12:35:24 -net tap -vnc localhost:0
-daemonize
ubuntu710.qcow2)

----------------------------------------------------------------------

Comment By: Darrin Eden (darrineden)
Date: 2007-12-21 14:50

Message:
Logged In: YES 
user_id=1964687
Originator: NO

Hi,

I believe I'm experiencing a similar condition.

- cpu model: Intel Xeon E5345
- kvm version: 56
- host kernel: 2.6.23.9
- kernel arch: x86_64
- guest: ubuntu-7.10-server-amd64, 2.6.22-14
- qemu command: qemu-system-x86_64 -m 1536 -smp 4 -net
nic,macaddr=52:54:00:12:35:24 -net tap -vnc localhost:0 -daemonize
ubuntu710.qcow2

symptom: I have a couple systems configured similarly and each exhibit
this condition to a varying degree. Guest networking simply stops seemingly
dependent on load. Nothing of interest is recorded by the host or the guest
at that point. The 'work around' is stopping and starting the network
interface on the guest via VNC. I don't have any hard data, but my
perception is that 1) the more guests running the higher the failure
frequency and 2) guests seem to fail in groups. For instance three of eight
guests will cease to network simultaneously. The remainder stay networked.
I haven't been able to discover any pattern to the grouping although I have
a relatively small sample size at this point. Another perception I have is
that failures occur more frequently with lots of smaller connections
instead of large amounts of throughput. Again, no real data to back this
observation.

Thanks for a wonderfully designed system in any case! I'm absolutely
thrilled with every other aspect of kvm.

Sincerely,
- Darrin Eden

----------------------------------------------------------------------

Comment By: Fabian Deutsch (fabiandeutsch)
Date: 2007-12-18 14:10

Message:
Logged In: YES 
user_id=353204
Originator: NO

Hey.

I also run into this bug in the following ..
o Setup:
Host:
- Intel(R) Xeon(R) CPU           X3210  @ 2.13GHz
- Mem:          4051
- F7 (all updates)
- kvm-54-341-gefdeac0

Guest:
- F8
- 2 realtek nics
- Samba share sharing a mountpoint, mounted on an iSCSI session/disk.
- A client copies one file (30GB) onto the share, which is mapped to an
iscsi (so much traffic going in and out).

o Symptom:
- The network dies after about 2x30GB of heavy network load.
- Guests 1st and 3rd field in /proc/net/softnet_stat keep increasing.
- Network works again when doing an "service network restart"
- The host sees network traffic on the tun-interface, the guest doesn't.
- It seams as if the network dies during the burst period. (30GB in 6hrs,
not much traffic during the other 18hrs).

- A different guests (also f8) network, which also transfers a lot of data
but not in bursts, more in a continous stream, doesn't die.

----------------------------------------------------------------------

Comment By: Eckie Silapaswang (esila)
Date: 2007-11-21 11:26

Message:
Logged In: YES 
user_id=1898498
Originator: YES

izike and technologov,

Thanks so much for the attempts and effort thus far, much appreciated!

To stress the instance we have a perl script which connects to the SMTP
socket, making the connection, and printing to the socket to deliver an
email.  This script allows us to hammer up to X emails / second (we've been
using 10).  The load is generated with a combination of this sending and
the guest OS running amavis / spamassassin so it must check every mail that
goes through.  The hammering was sustained over a period of 3 hours before
the connection was lost between the host and guest OS.

As stated before, I can open up our firewalls to allow you access to the
systems in question and let you see what is going on - maybe you'll spot
something in the configuration that we're not seeing.  Let me know if this
is a viable option and what information you'll need from me to follow
through with this.

Best regards and have a Happy T-Giving to all!

----------------------------------------------------------------------

Comment By: Izik Eidus (izike)
Date: 2007-11-21 03:11

Message:
Logged In: YES 
user_id=1851802
Originator: NO

esila,
i have tried very hard to make it die on my machine and couldnt get it to
die.
you have any ideas what we can do ?

----------------------------------------------------------------------

Comment By: Technologov (technologov)
Date: 2007-11-19 04:49

Message:
Logged In: YES 
user_id=1839746
Originator: NO

esila: Could you explain a bit more about the stress tests you have done
?

I have downloaded nGarde Secure Linux 32bit i686 v3.

Which commands will stress test it ?

----------------------------------------------------------------------

Comment By: Technologov (technologov)
Date: 2007-11-18 06:14

Message:
Logged In: YES 
user_id=1839746
Originator: NO

Tested on F7/x64, Intel CPU, KVM-52.

----------------------------------------------------------------------

Comment By: Technologov (technologov)
Date: 2007-11-18 06:12

Message:
Logged In: YES 
user_id=1839746
Originator: NO

I have setup similar configuration, but with F7/x64 host and SUSE 10.3/32
guest, and I have transmitted several gigs of data. No, it just doesn't
crash.

I dont have EnGarde. I'm downloading it.

Bug Unreproduceble.

-Alexey

----------------------------------------------------------------------

Comment By: Eckie Silapaswang (esila)
Date: 2007-11-12 12:20

Message:
Logged In: YES 
user_id=1898498
Originator: YES

Hi izike,

Most certainly!  Thanks for the reply!

Host Side:
Host OS: EnGarde Secure Linux 3.0.17
CPUs: 8 x Intel(R) Xeon(R) CPU E5335 @ 2.00GHz
Latest upstream KVM from git
Host Kernel: 2.6.23-rc3

Guest OS: EnGarde Secure Linux 32bit i686, 2.4.31-1.5.60

Command Line:
/usr/bin/qemu-system -hda /root/images/bwimail01.img -boot c -m 384 -smp
4
-std-vga -net nic,vlan=0,macaddr=52:54:00:12:34:6F -net
tap,ifname=tap1,script=/etc/qemu-ifup -vnc 192.168.1.57:1 &
Kernel Arch: x86_64

To produce the load I've configured a send script that connects to the
guest OS over port 25 and sends X amount of messages per second.  In this
case I've been hammering the system with about 10/sec.

If you need anymore information, please let me know as I can provide you
access to this particular host / guestOS if need be.

Thanks!

----------------------------------------------------------------------

Comment By: Izik Eidus (izike)
Date: 2007-11-12 02:36

Message:
Logged In: YES 
user_id=1851802
Originator: NO

esila,
can you please give me the exact configuration you have:
1. on the host side
2. on the guest side
3. how can i make such heavy load as you describe?

i tried to kill the network on my machine and i was not able,
so if you can please provide as much information as you can so we can fix
it
thanks.

----------------------------------------------------------------------

Comment By: Eckie Silapaswang (esila)
Date: 2007-11-08 08:58

Message:
Logged In: YES 
user_id=1898498
Originator: YES

Used latest upstream of KVM as of 11/6 and networking died after 2 and
half hours of heavy load.  Will be repeating with the no-kvm-irqchip and
dsahern's 'noapic' kernel switch later on and updating everyone.

----------------------------------------------------------------------

Comment By: Eckie Silapaswang (esila)
Date: 2007-11-02 16:39

Message:
Logged In: YES 
user_id=1898498
Originator: YES

Used the latest upstream of KVM and tried dsahern's 'noapic' switch to the
kernel options.  Networking died after a 3 hour period of intensive load on
it.

----------------------------------------------------------------------

Comment By: david ahern (dsahern)
Date: 2007-10-25 12:11

Message:
Logged In: YES 
user_id=1755596
Originator: NO

In my case, I found that a workaround is adding 'noapic' to the guest
kernel options. It ran fine for an hour with a moderately heavy, continuous
load on it. 

----------------------------------------------------------------------

Comment By: Eckie Silapaswang (esila)
Date: 2007-10-25 10:35

Message:
Logged In: YES 
user_id=1898498
Originator: YES

Just adding that I will be able to provide access and open up our
firewalls to the particular instance in question if this will aide in
resolving this bug.

----------------------------------------------------------------------

Comment By: david ahern (dsahern)
Date: 2007-10-23 12:22

Message:
Logged In: YES 
user_id=1755596
Originator: NO

I am still experiencing the problem with kvm-48 and both nic models --
rtl8139 and ne2k_pci.

Host this time is a dual cpu, dual core PowerEdge 2950 running RHEL5.

Guest is running RHEL4U4. qemu command (for rtl8139 nic):

qemu-system-x86_64 -boot c -localtime -hda /opt/kvm/images/cucm.vmdk -m
1536 -smp 2 -serial file:/tmp/serial.log -net
nic,macaddr=00:1a:4b:34:74:52,model=rtl8139 -net
tap,ifname=tap0,script=/bin/true -vnc :2 -monitor stdio

I have tried with and without kqemu. In my case rtl8139 quits working
fairly quickly; ne2k_pci takes much longer. 

Again looking at the softnet_stats I see the time_squeeze (column 3) and
cpu_collision (column 9) counters incrementing from startup. Here's an
example after 24 minutes of uptime and roughly 5 minutes of network
traffic:

[root@vm-cucm ~]# cat /proc/net/softnet_stat 
000be1b5 00000b8b 0000075d 00000006 00000000 00000000 00000000 00000000
005049d2
00021220 00000000 000000f9 00000000 00000000 00000000 00000000 00000000
006820c3

IRQ cpu % is fairly high. For example with the ne2k nic it is in the 5-6%
range during "light" traffic (VOIP endpoint registration) and in the 12-19%
range during my top end load (VOIP traffic). This irq CPU load is *much*
higher than both xen and vmware.

>From the host side I do not notice any change in CPU consumption by the
qemu process during the network lockup.

Not sure if this matters, but I have not been able to start any of my
images (rhel3u8, rhel4u4, or rhel5) with -smp set to 4; 1 or 2 vcpus seems
to be the only options.

----------------------------------------------------------------------

Comment By: Doc Watson (doc_watson)
Date: 2007-10-23 07:03

Message:
Logged In: YES 
user_id=1915499
Originator: NO

Yes, my test (FTP) works whith KVM-48 and nic rtl8139, and this model is
faster than then ne2000.

ne2000:   5.5 Mbits/s
rtl8139:  9.6 Mbits/s

I will do some new tests to stress the network.

----------------------------------------------------------------------

Comment By: Avi Kivity (avik)
Date: 2007-10-22 02:49

Message:
Logged In: YES 
user_id=539971
Originator: NO

There are reports on the list that model=rtl8139 doesn't have this
problem.  So this may be an issue with ne2k emulation.

----------------------------------------------------------------------

Comment By: Doc Watson (doc_watson)
Date: 2007-10-21 14:34

Message:
Logged In: YES 
user_id=1915499
Originator: NO

I have done the tests on AMD and Intel and with KVM-48, and it's the same
problem.

----------------------------------------------------------------------

Comment By: david ahern (dsahern)
Date: 2007-10-19 10:49

Message:
Logged In: YES 
user_id=1755596
Originator: NO

If someone familiar with the kvm code can point me in a direction
(files/functions) I'd be happy to help track this down.
david

----------------------------------------------------------------------

Comment By: Eckie Silapaswang (esila)
Date: 2007-10-19 08:43

Message:
Logged In: YES 
user_id=1898498
Originator: YES

A quick update on recent test runs:

> - same, plus the -no-kvm-irqchip flag
Networking still went down under heavy load.

> - same, plus the -no-kvm flag (should work now)
This worked, but there was obvious drop in performance.

----------------------------------------------------------------------

Comment By: david ahern (dsahern)
Date: 2007-10-18 11:38

Message:
Logged In: YES 
user_id=1755596
Originator: NO

I ran into this problem as well using kvm-46. Host OS is RHEL5. guest OS
is RHEL4. Problem is repeatable with ne2k_pci nic model as well as rtl8139.
I also tried limiting to 1 cpu.

qemu command line is:
qemu-system-x86_64 -boot c -localtime -hda /opt/kvm/images/my.vmdk -m 1536
-smp 2 -serial file:/tmp/serial.log -net
nic,macaddr=00:1a:4b:34:74:52,model=ne2k_pci -net
tap,ifname=tap0,script=/bin/true -vnc :2 -monitor stdio

Looking at /proc/net/softnet_stat of the guest OS the third column is
increasing when the network lockup happens which means packets are getting
dropped due to time squeeze. I believe that means the receive softirq is
taking too long which in turn means the communication with the device is
taking too long.

Taking the interface down and backup clears the problem until it reaches
some packet threshold again.

----------------------------------------------------------------------

Comment By: Doc Watson (doc_watson)
Date: 2007-10-18 03:00

Message:
Logged In: YES 
user_id=1915499
Originator: NO

I have done sevral tests, with the same host and guest, I use a slackware
11 and a kernel 2.6.20. The cpu of the host is an AMD 3800 64 X2 with 1Gb
of RAM. The test is from the guest to get by ftp on binary mode a file of
4Gb on the host, and I do a loop on the get command. 

The network dies on heavy load since kvm-36. In kvm-35 all is good.

kvm 35: Ok
kvm 36: Only one get success
kvm 37-39: Doesn't compile or module doesn't load
kvm 40-46: Network crash
With no kvm swith: kvm-46 is ok

----------------------------------------------------------------------

Comment By: Doc Watson (doc_watson)
Date: 2007-10-17 11:31

Message:
Logged In: YES 
user_id=1915499
Originator: NO

I have the same problem with kvm-46 when I use FTP with big files. With
-no-kvm flag all is good. There is the same problem with the module kqemu.

----------------------------------------------------------------------

Comment By: Eckie Silapaswang (esila)
Date: 2007-10-16 16:46

Message:
Logged In: YES 
user_id=1898498
Originator: YES

> Does this occur:
> - when using the modules provided by kvm-46 (with kvm-46 userspace)

We were using a build from your git repository:

  http://git.kernel.org/?p=linux/kernel/git/avi/kvm.git;a=summary

  Not from the kvm-46 tarball.

> - same, plus the -no-kvm-irqchip flag
> - same, plus the -no-kvm flag (should work now)

  These would only apply to userspace.

----------------------------------------------------------------------

Comment By: Avi Kivity (avik)
Date: 2007-10-16 09:41

Message:
Logged In: YES 
user_id=539971
Originator: NO

Does this occur:
- when using the modules provided by kvm-46 (with kvm-46 userspace)
- same, plus the -no-kvm-irqchip flag
- same, plus the -no-kvm flag (should work now)

This will show whether the problem is in the kernel or userspace.  If it's
a kernel issue, we will supply a backport to 2.6.23.

----------------------------------------------------------------------

Comment By: Eckie Silapaswang (esila)
Date: 2007-10-15 12:06

Message:
Logged In: YES 
user_id=1898498
Originator: YES

Just wanted to follow up: upgraded to the latest kvm and kernel (2.6.23)
and the issue still exists.  The host operating OS is EnGarde Secure Linux
3.0.17 - can provide testing environment if needed - all other information
is included in the previous bug entry.

----------------------------------------------------------------------

Comment By: SourceForge Robot (sf-robot)
Date: 2007-10-09 21:20

Message:
Logged In: YES 
user_id=1312539
Originator: NO

This Tracker item was closed automatically by the system. It was
previously set to a Pending status, and the original submitter
did not respond within 14 days (the time period specified by
the administrator of this Tracker).

----------------------------------------------------------------------

Comment By: Matt Piermarini (mpiermar)
Date: 2007-09-28 14:22

Message:
Logged In: YES 
user_id=544440
Originator: NO

Just a FYI, I also see this error. I'm running KVM-44 with kernel modules
from kernel-2.6.23-rc8.  It happens when my Guest (RHEL5) is reading files
from a NFS mount across the local LAN.  Run command:

-hda Disk1.img -hdb Disk2.qcow2 -boot c -net nic,vlan=0 -net
tap,vlan=0,ifname=tap1,script=/etc/qemu-ifup -m 1024 -localtime
-no-kvm-irqchip

It does not happen on demand, but I'll try to isolate switches/kernel
modules.  If I find anything, I'll post back.

----------------------------------------------------------------------

Comment By: Avi Kivity (avik)
Date: 2007-09-25 09:47

Message:
Logged In: YES 
user_id=539971
Originator: NO

Please repeat with the latest kvm.  The bug may have been already fixed.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=1802082&group_id=180599
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html