[ kvm-Bugs-2506814 ] TAP network lockup after some traffic

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Bugs item #2506814, was opened at 2009-01-14 11:38
Message generated for change (Comment added) made by rdoitaly
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2506814&group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Fabio Coatti (cova)
Assigned to: Nobody/Anonymous (nobody)
Summary: TAP network lockup after some traffic

Initial Comment:
Hi all,
we are experiencing severe network troubles using kvm+tap networking.
basically, after some network load (we have yet to identify the exact amount of traffic if one exist) network stops working. 
During lockups, With tcpdump we can see arp requests on guest interface, then on tap, brX and physical interfaces on host system.
the arp answers can be seen, with tcpdump, only on physical host interface and bridge (brX), but not on tap device. Basically it seems that packets coming from external network get lost in tap device on the way to guest (kvm). Looking at tap data with ifconfig, the only weird thing is the TX packets overrun count that is > 0. every time the network stops working, overrun count increases.
This has been observed with several kvm releases (for sure, 76/77/78/79/80/81/82) and with different kernels (tried with some versions among 2.6.25.X, 26.X, 27.X, 28) both on guest and host side.
we tried several network drivers (virtio, e1000, rtl) and all shows the same problem. Only 100Mbit drivers seems to be unaffected so far. (only virtio has acceptable performance)
(btw: on host machine we have vlan on top of ethX devices)
cpu number on guest makes no difference.
we tried with vanilla kernel provided kvm modules and with kvm package provided modules.
guest: 32 bit
host: 64 bit
host machine:

2 x Quad-Core AMD Opteron(tm) Processor 2352 16GB, gentoo.

Of course I can provide more details or perform other tests and try patches, if someone can give me some hints and advices they will be most welcome.

Thanks.




----------------------------------------------------------------------

Comment By: Marcello Magnifico (rdoitaly)
Date: 2009-08-11 20:36

Message:
I have a variation of the same problem, as I'm NOT using KVM. My
configuration is: a physical 32-bit machine with Debian 5, (Linux 2.6.26-2
i686) with eth0 up with no IP, attached to br0 with the machine's physical
IP and tap0 attached to br0. 
tap0 is used by Qemu+Kqemu, running a nested Debian 5 machine with a
different IP on the same class as the physical one (and, yes, a manually
assigned, unique MAC address). Eberything runs fine until I decided to
migrate some GB of data from another physical system via SSH. After the
first about 60-80MB transferred, the nested system is isolated, one way.
Running tcpdump on br0 helped me understanding that the nested system can
send out ARP requests and replies seem to be received, but they never get
"inside".
There are no overruns visible, neither in the physical nor the nested
machine, but the symptom is just like that. It never happens while copying
short files, you need something about 4 Megabytes or more to reproduce it
(I have both short and long files in the lot, so I could notice easily the
difference).
I'm an absolute beginner at this kind of low-level debugging, so I hope
you'll forgive for being so naive in what I'll write here under. In fact, I
started trying randomly the following workarounds, all without effect (this
can tell where the problem probably isn't, narrowing the field).

-lowering the MTU within the nested system to about 1400, instead than the
default 1500 (I have already seen connections hanging after a while because
of MTU issues)
-rising to 500 the txqueuelen of br0, which defaulted to 0
-forcing both eth0 and br0 in promiscuous mode (they weren't in the
beginning)
-even placing a static route on the physical machine, stating that the
nested machine's IP is reachable though br0 (that, like above, was way
before watching tap0's traffic via tcpdump, my first idea was about ARP
issues or something like that)
-forcing a different network card emulation at qemu startup (I'm using
ne2k_pci at this time)
-skyrocketing txqueulens (upto 3000) on eth0, br0, tap0 after discovering
it was default 1000 on the eth0 of the nested, virtual machine
-lowering to 500 txqueuelen on the eth0 of the nested machine

Dmesg seems offering no clue at all. My best guess is about some sort of
"soft"/"silent" buffer overrun while being flooded by data, as the data
were always the same (=copied in the same sequence) but the moment when the
flood stops is never the same. Anyway, this experience suggests it probably
not to be a kvm issue but the "tap" driver and nothing else (how do we fill
a kernel bug report? heeheehee). Commands "ifdown eth0" and "ifup eth0"
within the nested machine bring all back on line by resetting "something",
but I wouldn't positively call that a real solution.

----------------------------------------------------------------------

Comment By: Tais P. Hansen (mellen)
Date: 2009-08-05 13:02

Message:
All my tests have been made with the kernel modules.

----------------------------------------------------------------------

Comment By: Fabio Coatti (cova)
Date: 2009-08-05 12:56

Message:
mellen, did you tested with kernel modules or kvm-provided modules?

----------------------------------------------------------------------

Comment By: Tais P. Hansen (mellen)
Date: 2009-08-04 17:48

Message:
Re-tested with kvm-88 and host kernel 2.6.30.3 and guest kernel 2.6.30.4.
I've have not been able to reproduce the problem so far. The test has been
running for 8 hours and 15 minutes now.

----------------------------------------------------------------------

Comment By: Daniel Schwager (danny1)
Date: 2009-07-13 11:09

Message:
> I installed KVM-86 at the moment and the first test works fine - no
disconnect
> at the moment (..) 

i got the disconnect with KVM-86 also..

c_jones, did you check it with the current KVM-88 release ?


----------------------------------------------------------------------

Comment By: Daniel Schwager (danny1)
Date: 2009-07-11 09:24

Message:
I can confirm exactly the same behavior with kvm-84 - all my machines uses
also a dhcp daemon for requesting there IP adr.

> My guest nodes are configured to get their IP addresses with DHCP. At
the
> DHCP server I see the DHCPDISCOVER requests from the guests, and I see
the
> DHCP server replying with DHCPOFFERs, but the guest never sees the
offers
> come in. 

I installed KVM-86 at the moment and the first test works fine - no
disconnect
at the moment (..) 


----------------------------------------------------------------------

Comment By: Chris Jones (c_jones)
Date: 2009-07-11 05:14

Message:
I'm experiencing exactly the same problem.

I've been running kvm-84 for quite a while and it works great there.  But
I tried moving to kvm-87 and on kvm-87, I'm getting exactly the same
behavior you're reporting here -- except all the time (it doesn't ever
work).

My guest nodes are configured to get their IP addresses with DHCP.  At the
DHCP server I see the DHCPDISCOVER requests from the guests, and I see the
DHCP server replying with DHCPOFFERs, but the guest never sees the offers
come in. 

So, just like the other reports - packets outbound from the guest are
fine, but inbound they seem to get dropped. 

Like the other reports, DHCP requests are UDP, so I guess this might lend
weight to the argument this could happen more often with UDP.

My environment (where it fails)
   kvm-87
   kernel 2.6.27.26
   Using the kvm-87 user component and kernel component.

On kvm-84 all works fine (which seems to be in contrast to another
person's report).

Mine is easily reproducible - let me know if you want me to collect any
information.

----------------------------------------------------------------------

Comment By: Daniel Schwager (danny1)
Date: 2009-07-08 21:16

Message:
> Mellen, did you try KVM-85 ?
not -85, but did you try the current on: KVM-87 ?


----------------------------------------------------------------------

Comment By: Daniel Schwager (danny1)
Date: 2009-07-08 21:13

Message:
Hi,

i have the same (?) problem. We are running about 20 x  WinXP's using RDP
on an 64bit FC9 with KVM-84, bridge-network, tap-device with e1000, smp-1.

Sometimes (most (!) 3-4 minutes after starting a winxp-vm), i loose
the RDP-connection. While this disconnection, a running ping inside the VM
tells's me about ping problems. After only 2-3 Seconds later, i can
reconnect the rdp and all works find ... until the next disconnect..

Mellen, did you try KVM-85 ?

regards
Danny



----------------------------------------------------------------------

Comment By: Tais P. Hansen (mellen)
Date: 2009-07-02 17:22

Message:
So far, I have been unable to reproduce this with a guest using only one
cpu (-smp 1). Previous tests have been made on guests with -smp 2 or 4.

----------------------------------------------------------------------

Comment By: Tais P. Hansen (mellen)
Date: 2009-06-29 18:43

Message:
I have been able to reproduce this twice using pktgen on a remote host
shooting UDP packets through a kvm guest. The guest stalled after about 10
minutes of heavy UDP traffic (40000 packets per second of 200 bytes each).

The KVM guest have a simple iptables nat rule forwarding the UDP packets
from eth0 out vlan1000 (on the same interface).


----------------------------------------------------------------------

Comment By: Fabio Coatti (cova)
Date: 2009-06-09 13:35

Message:
The only way that I've found to reproduce this issue is to have the guest
create traffic on network (say, using it for wgetting or ftp some external
site). After some time or traffic the network stops working.
I'll try to reproduce again it on newer kvm versions, but no more reliable
way to cause hangs has been identified to me.

Thanks.




----------------------------------------------------------------------

Comment By: Tais P. Hansen (mellen)
Date: 2009-06-08 14:31

Message:
I haven't found a reliable way to make a guest loose its network but it
seems like UDP traffic is more likely to cause this problem.

A longer (30-60minute) RDP (remote desktop protocol) session of about
5mbps from one Microsoft box to another crossing a guest acting as
router/firewall seems to kill the network on that guest. That has happened
at least 3 times.

That's as close to reproducing the problem as I've gotten so far.

----------------------------------------------------------------------

Comment By: Avi Kivity (avik)
Date: 2009-06-08 13:57

Message:
Is there a reliable way to reproduce this?

----------------------------------------------------------------------

Comment By: Tais P. Hansen (mellen)
Date: 2009-05-07 13:29

Message:
Just had another stall. Different host, different guest. Just one guest on
that host system.

What information would help debug this system?

efer_reload                    0         0
exits                 11147836817      5827
fpu_reload              31483547         1
halt_exits            8041208101      1397
halt_wakeup           5069136218         0
host_state_reload     13598781750      1507
hypercalls            5890581948         4
insn_emulation        8114024171      4338
insn_emulation_fail            0         0
invlpg                         0         0
io_exits              5263494017         2
irq_exits              781045211        53
irq_window                     0         0
largepages                     0         0
mmio_exits                 61151         0
mmu_cache_miss         156941726         0
mmu_flooded            158265518         0
mmu_pde_zapped          96062900         0
mmu_pte_updated       2302220251         4
mmu_pte_write         1759022770         4
mmu_recycled                   0         0
mmu_shadow_zapped      184691156         0
pf_fixed              4270608276         0
pf_guest              1226004351         0
remote_tlb_flush       481003516         2
request_irq                    0         0
signal_exits                   1         0
tlb_flush             4539655572        29


----------------------------------------------------------------------

Comment By: Fabio Coatti (cova)
Date: 2009-05-07 10:51

Message:
We are still biten by this issue. I'm running out of ideas, but if someone
can give me some hints on how to track down the problem or at least collect
more information I'll try it.

Thanks.

----------------------------------------------------------------------

Comment By: Tais P. Hansen (mellen)
Date: 2009-05-07 10:39

Message:
Just happened again. Seems like it stops generating interrupts on virtio
devices:

 10:       3001      51199   IO-APIC-fasteoi   virtio1, virtio2
 11:    7694835    8468134   IO-APIC-fasteoi   virtio0

doing an /bin/ls froze the guest for a minute or so until a CTRL-C got
through. After /bin/ls:

 10:       3001      51220   IO-APIC-fasteoi   virtio1, virtio2
 11:    7694835    8468134   IO-APIC-fasteoi   virtio0

kvm_stat:
efer_reload                    0         0
exits                 19150288617      1082
fpu_reload             119767301        34
halt_exits             544721221       194
halt_wakeup            300187634       141
host_state_reload      837279034       259
hypercalls            4991133618         0
insn_emulation        4797940911       709
insn_emulation_fail            0         0
invlpg                         0         0
io_exits               281254462        65
irq_exits              173205458         0
irq_window                     0         0
largepages                     0         0
mmio_exits                720457         0
mmu_cache_miss         290275800         0
mmu_flooded            289036380         0
mmu_pde_zapped         170308104         0
mmu_pte_updated       6239297430         0
mmu_pte_write         20470585377         0
mmu_recycled                   0         0
mmu_shadow_zapped      335324686         0
pf_fixed              8324827214         0
pf_guest              2229555421         0
remote_tlb_flush       283772247         0
request_irq                    0         0
signal_exits                   6         0
tlb_flush             5334530556        66

... but there are 6 other guests on this host running just fine.

Can't connect gdb to the kvm's gdbserver. It just says "Remote 'g' packet
reply is too long: ...."

Issuing system_reset stalled for a minute, then rebooted the guest. After
reboot, guest is find again.

----------------------------------------------------------------------

Comment By: Tais P. Hansen (mellen)
Date: 2009-05-06 17:07

Message:
I'm curious about the status of this issue?

I'm experiencing the same problem apparently randomly on guests.
Restarting the network interface does not seem to fix the problem. Only a
reboot (or system_reset in kvm/qemu console) solves it.

Last time it happened was on a host with kvm-84 and guest with kernel
2.6.27 using virtio-net. Leading up to the stall it had a traffic load of
about 5 mbit constantly up and down for just over 2 hours with one 12 mbit
spikes.

I did not check interface counters or network traces at the time.


----------------------------------------------------------------------

Comment By: Fabio Coatti (cova)
Date: 2009-01-14 14:24

Message:
It seems quite similar indeed, but ip link set eth up/down on guest side
seems to have no effect. Besides that, looking at the thread you pointed
me, I can see another difference: sniffing on tap device in my case shows
only outgoing packets (i.e. leaving kvm guest).
So it can be the very same issue, but some differences are present.
At least, we are seeing this on newer kernels and kvm revisions :)


----------------------------------------------------------------------

Comment By: Mark McLoughlin (markmc)
Date: 2009-01-14 12:19

Message:
Does ifup/ifdown in the guest fix the hang?

If so, it sounds like the issue discussed in this long thread:

  http://www.mail-archive.com/kvm@xxxxxxxxxxxxxxx/msg06774.html

We still haven't got to the bottom of it.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2506814&group_id=180599
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux