I had something simular in the past. This turned out to be a broken
network card (3COM 3C905).... I hope this info can be of value to you...
Török Edvin wrote:
On 7/7/06, Rene Mayrhofer <rene.mayrhofer@xxxxxxxxxxxx> wrote:
Hi all,
Hi,
[why did this mail got delayed 3 days?
Received: from localhost ([127.0.0.1] helo=vishnu.netfilter.org)
by vishnu.netfilter.org with esmtp (Exim 4.60 #1 (Debian))
id 1FzulL-0003tk-Tz; Mon, 10 Jul 2006 14:21:19 +0200
Received: from jupiter.gibraltar.at ([80.120.3.32])
by vishnu.netfilter.org with esmtps (Exim 4.60 #1 (Debian))
id 1Fys2y-0001ni-3E
for <netfilter@xxxxxxxxxxxxxxxxxxx>; Fri, 07 Jul 2006 17:15:13 +0200
]
Since a few days, we have a serious kernel problem on one of our
production
firewalls. It seems to be a problem in the interaction between
openswan's
KLIPS, netfilter, and routing code, at least as fas as I see it. This
problem
is reproducible on one system (we have not yet managed to figure out
how to
reproduce it on another system) running kernel 2.4.32 with KLIPS
2.4.5 (and
pluto 2.4.5):
Is this similar to:
http://oss.sgi.com/archives/netdev/2004-12/msg00484.html?
Does your system continue to work after this, or does it flood your logs?
do_IRdo_IRQ: stack overflow: 292
c0252788 00000124 00000120 c028e000 61776bb8 644a8a78 c4562900 c0120da3
00000000 000000e4 00000080 61776bb8 644a8a78 c4562900 0000003c
00000018
c0280018 ffffff12 e0364a81 00000010 00000206 00000008 c028e764
c4562880
<snip>
messages appear as fast as the console can print them. ksymoops
decodes them
to
....
Trace; c011e861 <do_IRQ+e1/120>
Trace; c0120da3 <call_do_IRQ+5/12>
Trace; c011b280 <default_idle+0/50>
Trace; c011b2a3 <default_idle+23/50>
Trace; c011b332 <cpu_idle+42/70>
Trace; c011918c <L6+0/2>
Trace; c0120da3 <call_do_IRQ+5/12>
Why does this enter cpu_idle at all?
Trace; e0364a81 <[ipsec]aes_encrypt+8d1/f40>
Trace; e0365388 <[ipsec]aes_decrypt+298/f60>
Trace; e036576f <[ipsec]aes_decrypt+67f/f60>
Trace; e03604d0 <[ipsec]ipsec_rcv_esp_decrypt_setup+50/80>
Trace; e038f620 <[ipsec]SHA1Final+18830/25270>
Trace; e035edec <[ipsec]pfkey_msg_interp+8c/337>
Trace; e038f620 <[ipsec]SHA1Final+18830/25270>
Trace; e0350361 <[ipsec].text.lock.ipsec_proc+2c/3b>
Trace; e038f620 <[ipsec]SHA1Final+18830/25270>
Trace; e0347d75 <[ipx].text.lock.af_ipx+115/150>
Trace; e038d580 <[ipsec]SHA1Final+16790/25270>
Trace; e0351f91 <[ipsec]ipsec_tunnel_ioctl+51/240>
Trace; e034a040 <[ipx]ipx_unregister_sysctl+2270/2290>
Trace; c01e4bab <kfree_skbmem+b/70>
Trace; e034d5fb <[ipsec]rj_refines+ab/c0>
Trace; e034e3f1 <[ipsec]rj_walktree+d1/200>
Trace; c01f6748 <qdisc_restart+c8/160>
Trace; c01e8fdd <dev_queue_xmit+18d/430>
Trace; c0205800 <ip_finish_output2+0/150>
Trace; c01f2c0b <nf_hook_slow+15b/1c0>
Trace; c0205db7 <ip_output+137/1e0>
Trace; c0205800 <ip_finish_output2+0/150>
Trace; c0205650 <output_maybe_reroute+0/10>
Trace; c020565b <output_maybe_reroute+b/10>
Trace; c01f2c0b <nf_hook_slow+15b/1c0>
Trace; c0206a1d <ip_build_xmit_slow+3ad/5b0>
Trace; c0205650 <output_maybe_reroute+0/10>
Trace; c0206ee8 <ip_build_xmit+2c8/420>
Second time it goes through this code.
Trace; c02294e0 <icmp_glue_bits+0/d0>
Trace; c0229a1b <icmp_send+2db/390>
Trace; c02294e0 <icmp_glue_bits+0/d0>
Trace; e0347d75 <[ipx].text.lock.af_ipx+115/150>
Trace; e038d580 <[ipsec]SHA1Final+16790/25270>
Trace; e03524c7 <[ipsec]ipsec_tunnel_init+f7/140>
Trace; e034a040 <[ipx]ipx_unregister_sysctl+2270/2290>
Trace; c01e4bab <kfree_skbmem+b/70>
Trace; e034d5fb <[ipsec]rj_refines+ab/c0>
Trace; e034e3f1 <[ipsec]rj_walktree+d1/200>
Trace; c01f6748 <qdisc_restart+c8/160>
Trace; c01e8fdd <dev_queue_xmit+18d/430>
Trace; c0205800 <ip_finish_output2+0/150>
Trace; c01f2c0b <nf_hook_slow+15b/1c0>
Trace; c0205db7 <ip_output+137/1e0>
Trace; c0205800 <ip_finish_output2+0/150>
Trace; c0205650 <output_maybe_reroute+0/10>
Trace; c020565b <output_maybe_reroute+b/10>
Trace; c01f2c0b <nf_hook_slow+15b/1c0>
Trace; c0206a1d <ip_build_xmit_slow+3ad/5b0>
Trace; c0205650 <output_maybe_reroute+0/10>
Trace; c0206ee8 <ip_build_xmit+2c8/420>
^^
Trace; c02294e0 <icmp_glue_bits+0/d0>
Trace; c0229a1b <icmp_send+2db/390>
Trace; c02294e0 <icmp_glue_bits+0/d0>
Trace; e0347d75 <[ipx].text.lock.af_ipx+115/150>
Trace; e038d580 <[ipsec]SHA1Final+16790/25270>
Trace; e03524c7 <[ipsec]ipsec_tunnel_init+f7/140>
Trace; e034a040 <[ipx]ipx_unregister_sysctl+2270/2290>
...
etc. The only unusual bit on this machine seems to be that there are
SNAT
rules for traffic going into ipsec interfaces, i.e. with -A
POSTROUTING -o
ipsec0. It runs 10 IPSec tunnels on 2 interfaces, 5 tunnels on each
interface.
Any ideas what might be wrong? Right now this is a critical problem
for us,
and I would be happy about any pointer what to try. I can spare some
time to
try and track it down. Currently, we try to remove the SNAT rules and
work
around that, but as we can not trigger the problem (just wait for it to
happen, usually a few times a day), we can not reliably check if that
fixes
it.
Would recompiling with bigger stack size fix it?
[disclaimer: I am not a netfilter developer]
Cheers,
Edwin
--
Ing. A.C.J. van Amersfoort (Arno)
Department Of Electronics (ELD, k1007)
Huygens Laboratory
Leiden University
P.O. Box 9504
Niels Bohrweg 2
2333 CA Leiden
The Netherlands
----------------------------------------------------------------
Phone : +31-(0)71-527.1894 Fax: +31-(0)71-527.5819
E-mail: a.c.j.van.amersfoort@xxxxxxxxxxxxxxxxxxxxxxxxx
----------------------------------------------------------------
Arno's (Linux firewall) homepage: http://rocky.eld.leidenuniv.nl