Bugs item #1802082, was opened at 2007-09-25 16:44 Message generated for change (Comment added) made by jessorensen You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=893831&aid=1802082&group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Eckie Silapaswang (esila) Assigned to: Nobody/Anonymous (nobody) Summary: Networking Dies Under Heavy Load Initial Comment: Running a stress test of kvm using an EnGarde Secure Linux 1.5 guest OS. Under a heavy network email load, the guest OS networking gets knocked out - unable to ping, ssh, etc. Can only get things started again by going into vncviewer and restarting the networking services from there. CPUs: 8 x Intel(R) Xeon(R) CPU E5335 @ 2.00GHz KVM 33-4 Host Kernel: 2.6.23-rc3 Kernel Arch: x86_64 Guest OS: EnGarde Secure Linux 32bit i686, 2.4.31-1.5.60 Command Line: /usr/bin/qemu-system -hda /root/images/bwimail01.img -boot c -m 384 -smp 4 -std-vga -net nic,vlan=0,macaddr=52:54:00:12:34:6F -net tap,ifname=tap1,script=/etc/qemu-ifup -vnc 192.168.1.57:1 & Cannot boot guest with the -no-kvm switch. Can provide remote access to the guest OS if needed for debugging purposes. Any help appreciated. Best, Eckie ---------------------------------------------------------------------- >Comment By: Jes Sorensen (jessorensen) Date: 2010-06-11 11:19 Message: Hi, Could you please let us know if this is still a problem with recent QEMU/KVM? If not, lets close this bug. Thanks, Jes ---------------------------------------------------------------------- Comment By: Arne Kepp (arneke) Date: 2008-02-25 05:14 Message: Logged In: YES user_id=822860 Originator: NO Just adding what I wrote on the -devel list: I saw this problem using the rtl8139 driver on 2.6.18-53.1.13.el5 (Red Hat kernel, CentOS). ethtool -S eth0 on the guest says, when locked up: NIC statistics: early_rx: 0 tx_buf_mapped: 0 tx_timeouts: 4 rx_lost_in_ring: 0 Adding noapic appears to resolve the issue. The host system is runing KVM-61 on the same kernel as the guest, it's a quad core x86_64. ---------------------------------------------------------------------- Comment By: Darrin Eden (darrineden) Date: 2007-12-22 09:17 Message: Logged In: YES user_id=1964687 Originator: NO Izik, Switching to rtl8139 appears to have corrected the problem. Thank you. ---------------------------------------------------------------------- Comment By: Izik Eidus (izike) Date: 2007-12-21 20:55 Message: Logged In: YES user_id=1851802 Originator: NO i think that i remember that someone reported that it was solved to him by using rtl8139 can you please try? (qemu command: qemu-system-x86_64 -m 1536 -smp 4 -net nic,model=rtl8139,macaddr=52:54:00:12:35:24 -net tap -vnc localhost:0 -daemonize ubuntu710.qcow2) ---------------------------------------------------------------------- Comment By: Darrin Eden (darrineden) Date: 2007-12-21 20:50 Message: Logged In: YES user_id=1964687 Originator: NO Hi, I believe I'm experiencing a similar condition. - cpu model: Intel Xeon E5345 - kvm version: 56 - host kernel: 2.6.23.9 - kernel arch: x86_64 - guest: ubuntu-7.10-server-amd64, 2.6.22-14 - qemu command: qemu-system-x86_64 -m 1536 -smp 4 -net nic,macaddr=52:54:00:12:35:24 -net tap -vnc localhost:0 -daemonize ubuntu710.qcow2 symptom: I have a couple systems configured similarly and each exhibit this condition to a varying degree. Guest networking simply stops seemingly dependent on load. Nothing of interest is recorded by the host or the guest at that point. The 'work around' is stopping and starting the network interface on the guest via VNC. I don't have any hard data, but my perception is that 1) the more guests running the higher the failure frequency and 2) guests seem to fail in groups. For instance three of eight guests will cease to network simultaneously. The remainder stay networked. I haven't been able to discover any pattern to the grouping although I have a relatively small sample size at this point. Another perception I have is that failures occur more frequently with lots of smaller connections instead of large amounts of throughput. Again, no real data to back this observation. Thanks for a wonderfully designed system in any case! I'm absolutely thrilled with every other aspect of kvm. Sincerely, - Darrin Eden ---------------------------------------------------------------------- Comment By: Fabian Deutsch (fabiandeutsch) Date: 2007-12-18 20:10 Message: Logged In: YES user_id=353204 Originator: NO Hey. I also run into this bug in the following .. o Setup: Host: - Intel(R) Xeon(R) CPU X3210 @ 2.13GHz - Mem: 4051 - F7 (all updates) - kvm-54-341-gefdeac0 Guest: - F8 - 2 realtek nics - Samba share sharing a mountpoint, mounted on an iSCSI session/disk. - A client copies one file (30GB) onto the share, which is mapped to an iscsi (so much traffic going in and out). o Symptom: - The network dies after about 2x30GB of heavy network load. - Guests 1st and 3rd field in /proc/net/softnet_stat keep increasing. - Network works again when doing an "service network restart" - The host sees network traffic on the tun-interface, the guest doesn't. - It seams as if the network dies during the burst period. (30GB in 6hrs, not much traffic during the other 18hrs). - A different guests (also f8) network, which also transfers a lot of data but not in bursts, more in a continous stream, doesn't die. ---------------------------------------------------------------------- Comment By: Eckie Silapaswang (esila) Date: 2007-11-21 17:26 Message: Logged In: YES user_id=1898498 Originator: YES izike and technologov, Thanks so much for the attempts and effort thus far, much appreciated! To stress the instance we have a perl script which connects to the SMTP socket, making the connection, and printing to the socket to deliver an email. This script allows us to hammer up to X emails / second (we've been using 10). The load is generated with a combination of this sending and the guest OS running amavis / spamassassin so it must check every mail that goes through. The hammering was sustained over a period of 3 hours before the connection was lost between the host and guest OS. As stated before, I can open up our firewalls to allow you access to the systems in question and let you see what is going on - maybe you'll spot something in the configuration that we're not seeing. Let me know if this is a viable option and what information you'll need from me to follow through with this. Best regards and have a Happy T-Giving to all! ---------------------------------------------------------------------- Comment By: Izik Eidus (izike) Date: 2007-11-21 09:11 Message: Logged In: YES user_id=1851802 Originator: NO esila, i have tried very hard to make it die on my machine and couldnt get it to die. you have any ideas what we can do ? ---------------------------------------------------------------------- Comment By: Technologov (technologov) Date: 2007-11-19 10:49 Message: Logged In: YES user_id=1839746 Originator: NO esila: Could you explain a bit more about the stress tests you have done ? I have downloaded nGarde Secure Linux 32bit i686 v3. Which commands will stress test it ? ---------------------------------------------------------------------- Comment By: Technologov (technologov) Date: 2007-11-18 12:14 Message: Logged In: YES user_id=1839746 Originator: NO Tested on F7/x64, Intel CPU, KVM-52. ---------------------------------------------------------------------- Comment By: Technologov (technologov) Date: 2007-11-18 12:12 Message: Logged In: YES user_id=1839746 Originator: NO I have setup similar configuration, but with F7/x64 host and SUSE 10.3/32 guest, and I have transmitted several gigs of data. No, it just doesn't crash. I dont have EnGarde. I'm downloading it. Bug Unreproduceble. -Alexey ---------------------------------------------------------------------- Comment By: Eckie Silapaswang (esila) Date: 2007-11-12 18:20 Message: Logged In: YES user_id=1898498 Originator: YES Hi izike, Most certainly! Thanks for the reply! Host Side: Host OS: EnGarde Secure Linux 3.0.17 CPUs: 8 x Intel(R) Xeon(R) CPU E5335 @ 2.00GHz Latest upstream KVM from git Host Kernel: 2.6.23-rc3 Guest OS: EnGarde Secure Linux 32bit i686, 2.4.31-1.5.60 Command Line: /usr/bin/qemu-system -hda /root/images/bwimail01.img -boot c -m 384 -smp 4 -std-vga -net nic,vlan=0,macaddr=52:54:00:12:34:6F -net tap,ifname=tap1,script=/etc/qemu-ifup -vnc 192.168.1.57:1 & Kernel Arch: x86_64 To produce the load I've configured a send script that connects to the guest OS over port 25 and sends X amount of messages per second. In this case I've been hammering the system with about 10/sec. If you need anymore information, please let me know as I can provide you access to this particular host / guestOS if need be. Thanks! ---------------------------------------------------------------------- Comment By: Izik Eidus (izike) Date: 2007-11-12 08:36 Message: Logged In: YES user_id=1851802 Originator: NO esila, can you please give me the exact configuration you have: 1. on the host side 2. on the guest side 3. how can i make such heavy load as you describe? i tried to kill the network on my machine and i was not able, so if you can please provide as much information as you can so we can fix it thanks. ---------------------------------------------------------------------- Comment By: Eckie Silapaswang (esila) Date: 2007-11-08 14:58 Message: Logged In: YES user_id=1898498 Originator: YES Used latest upstream of KVM as of 11/6 and networking died after 2 and half hours of heavy load. Will be repeating with the no-kvm-irqchip and dsahern's 'noapic' kernel switch later on and updating everyone. ---------------------------------------------------------------------- Comment By: Eckie Silapaswang (esila) Date: 2007-11-02 22:39 Message: Logged In: YES user_id=1898498 Originator: YES Used the latest upstream of KVM and tried dsahern's 'noapic' switch to the kernel options. Networking died after a 3 hour period of intensive load on it. ---------------------------------------------------------------------- Comment By: david ahern (dsahern) Date: 2007-10-25 19:11 Message: Logged In: YES user_id=1755596 Originator: NO In my case, I found that a workaround is adding 'noapic' to the guest kernel options. It ran fine for an hour with a moderately heavy, continuous load on it. ---------------------------------------------------------------------- Comment By: Eckie Silapaswang (esila) Date: 2007-10-25 17:35 Message: Logged In: YES user_id=1898498 Originator: YES Just adding that I will be able to provide access and open up our firewalls to the particular instance in question if this will aide in resolving this bug. ---------------------------------------------------------------------- Comment By: david ahern (dsahern) Date: 2007-10-23 19:22 Message: Logged In: YES user_id=1755596 Originator: NO I am still experiencing the problem with kvm-48 and both nic models -- rtl8139 and ne2k_pci. Host this time is a dual cpu, dual core PowerEdge 2950 running RHEL5. Guest is running RHEL4U4. qemu command (for rtl8139 nic): qemu-system-x86_64 -boot c -localtime -hda /opt/kvm/images/cucm.vmdk -m 1536 -smp 2 -serial file:/tmp/serial.log -net nic,macaddr=00:1a:4b:34:74:52,model=rtl8139 -net tap,ifname=tap0,script=/bin/true -vnc :2 -monitor stdio I have tried with and without kqemu. In my case rtl8139 quits working fairly quickly; ne2k_pci takes much longer. Again looking at the softnet_stats I see the time_squeeze (column 3) and cpu_collision (column 9) counters incrementing from startup. Here's an example after 24 minutes of uptime and roughly 5 minutes of network traffic: [root@vm-cucm ~]# cat /proc/net/softnet_stat 000be1b5 00000b8b 0000075d 00000006 00000000 00000000 00000000 00000000 005049d2 00021220 00000000 000000f9 00000000 00000000 00000000 00000000 00000000 006820c3 IRQ cpu % is fairly high. For example with the ne2k nic it is in the 5-6% range during "light" traffic (VOIP endpoint registration) and in the 12-19% range during my top end load (VOIP traffic). This irq CPU load is *much* higher than both xen and vmware. >From the host side I do not notice any change in CPU consumption by the qemu process during the network lockup. Not sure if this matters, but I have not been able to start any of my images (rhel3u8, rhel4u4, or rhel5) with -smp set to 4; 1 or 2 vcpus seems to be the only options. ---------------------------------------------------------------------- Comment By: Doc Watson (doc_watson) Date: 2007-10-23 14:03 Message: Logged In: YES user_id=1915499 Originator: NO Yes, my test (FTP) works whith KVM-48 and nic rtl8139, and this model is faster than then ne2000. ne2000: 5.5 Mbits/s rtl8139: 9.6 Mbits/s I will do some new tests to stress the network. ---------------------------------------------------------------------- Comment By: Avi Kivity (avik) Date: 2007-10-22 09:49 Message: Logged In: YES user_id=539971 Originator: NO There are reports on the list that model=rtl8139 doesn't have this problem. So this may be an issue with ne2k emulation. ---------------------------------------------------------------------- Comment By: Doc Watson (doc_watson) Date: 2007-10-21 21:34 Message: Logged In: YES user_id=1915499 Originator: NO I have done the tests on AMD and Intel and with KVM-48, and it's the same problem. ---------------------------------------------------------------------- Comment By: david ahern (dsahern) Date: 2007-10-19 17:49 Message: Logged In: YES user_id=1755596 Originator: NO If someone familiar with the kvm code can point me in a direction (files/functions) I'd be happy to help track this down. david ---------------------------------------------------------------------- Comment By: Eckie Silapaswang (esila) Date: 2007-10-19 15:43 Message: Logged In: YES user_id=1898498 Originator: YES A quick update on recent test runs: > - same, plus the -no-kvm-irqchip flag Networking still went down under heavy load. > - same, plus the -no-kvm flag (should work now) This worked, but there was obvious drop in performance. ---------------------------------------------------------------------- Comment By: david ahern (dsahern) Date: 2007-10-18 18:38 Message: Logged In: YES user_id=1755596 Originator: NO I ran into this problem as well using kvm-46. Host OS is RHEL5. guest OS is RHEL4. Problem is repeatable with ne2k_pci nic model as well as rtl8139. I also tried limiting to 1 cpu. qemu command line is: qemu-system-x86_64 -boot c -localtime -hda /opt/kvm/images/my.vmdk -m 1536 -smp 2 -serial file:/tmp/serial.log -net nic,macaddr=00:1a:4b:34:74:52,model=ne2k_pci -net tap,ifname=tap0,script=/bin/true -vnc :2 -monitor stdio Looking at /proc/net/softnet_stat of the guest OS the third column is increasing when the network lockup happens which means packets are getting dropped due to time squeeze. I believe that means the receive softirq is taking too long which in turn means the communication with the device is taking too long. Taking the interface down and backup clears the problem until it reaches some packet threshold again. ---------------------------------------------------------------------- Comment By: Doc Watson (doc_watson) Date: 2007-10-18 10:00 Message: Logged In: YES user_id=1915499 Originator: NO I have done sevral tests, with the same host and guest, I use a slackware 11 and a kernel 2.6.20. The cpu of the host is an AMD 3800 64 X2 with 1Gb of RAM. The test is from the guest to get by ftp on binary mode a file of 4Gb on the host, and I do a loop on the get command. The network dies on heavy load since kvm-36. In kvm-35 all is good. kvm 35: Ok kvm 36: Only one get success kvm 37-39: Doesn't compile or module doesn't load kvm 40-46: Network crash With no kvm swith: kvm-46 is ok ---------------------------------------------------------------------- Comment By: Doc Watson (doc_watson) Date: 2007-10-17 18:31 Message: Logged In: YES user_id=1915499 Originator: NO I have the same problem with kvm-46 when I use FTP with big files. With -no-kvm flag all is good. There is the same problem with the module kqemu. ---------------------------------------------------------------------- Comment By: Eckie Silapaswang (esila) Date: 2007-10-16 23:46 Message: Logged In: YES user_id=1898498 Originator: YES > Does this occur: > - when using the modules provided by kvm-46 (with kvm-46 userspace) We were using a build from your git repository: http://git.kernel.org/?p=linux/kernel/git/avi/kvm.git;a=summary Not from the kvm-46 tarball. > - same, plus the -no-kvm-irqchip flag > - same, plus the -no-kvm flag (should work now) These would only apply to userspace. ---------------------------------------------------------------------- Comment By: Avi Kivity (avik) Date: 2007-10-16 16:41 Message: Logged In: YES user_id=539971 Originator: NO Does this occur: - when using the modules provided by kvm-46 (with kvm-46 userspace) - same, plus the -no-kvm-irqchip flag - same, plus the -no-kvm flag (should work now) This will show whether the problem is in the kernel or userspace. If it's a kernel issue, we will supply a backport to 2.6.23. ---------------------------------------------------------------------- Comment By: Eckie Silapaswang (esila) Date: 2007-10-15 19:06 Message: Logged In: YES user_id=1898498 Originator: YES Just wanted to follow up: upgraded to the latest kvm and kernel (2.6.23) and the issue still exists. The host operating OS is EnGarde Secure Linux 3.0.17 - can provide testing environment if needed - all other information is included in the previous bug entry. ---------------------------------------------------------------------- Comment By: SourceForge Robot (sf-robot) Date: 2007-10-10 04:20 Message: Logged In: YES user_id=1312539 Originator: NO This Tracker item was closed automatically by the system. It was previously set to a Pending status, and the original submitter did not respond within 14 days (the time period specified by the administrator of this Tracker). ---------------------------------------------------------------------- Comment By: Matt Piermarini (mpiermar) Date: 2007-09-28 21:22 Message: Logged In: YES user_id=544440 Originator: NO Just a FYI, I also see this error. I'm running KVM-44 with kernel modules from kernel-2.6.23-rc8. It happens when my Guest (RHEL5) is reading files from a NFS mount across the local LAN. Run command: -hda Disk1.img -hdb Disk2.qcow2 -boot c -net nic,vlan=0 -net tap,vlan=0,ifname=tap1,script=/etc/qemu-ifup -m 1024 -localtime -no-kvm-irqchip It does not happen on demand, but I'll try to isolate switches/kernel modules. If I find anything, I'll post back. ---------------------------------------------------------------------- Comment By: Avi Kivity (avik) Date: 2007-09-25 16:47 Message: Logged In: YES user_id=539971 Originator: NO Please repeat with the latest kvm. The bug may have been already fixed. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=893831&aid=1802082&group_id=180599 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html