2.2.14 SMP 3com905: transmit timed out: Odd lost irq and ip-stack lockup

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear list,

I run a Compaq Proliant 1500 (dual Pentium 75.200) with hardware raid
(Smart2) with two ethernet cards 3com905 (b or c, I can't tell you right
now) as a firewall and web/mail virus scanner which (needless to say)
needs to be up 7d/24h.

Recently, during a pretty fast download the machine (ethernet technically,
you could login on the console, even ping the ethernet ip address) locked
up with the following error log:

Oct  9 17:29:02 fwintern kernel: eth0: transmit timed out, tx_status
00 status e681.
Oct  9 17:29:02 fwintern kernel: eth0: Interrupt posted but not
delivered -- IRQ blocked by another device?
Oct  9 17:29:02 fwintern kernel:   Flags; bus-master 1, full 0; dirty
6543269 current 6543269.
Oct  9 17:29:02 fwintern kernel:   Transmit list 00000000 vs.
cbef4250.
Oct  9 17:29:02 fwintern kernel:   0: @cbef4200  length 800005ea
status 000105ea
Oct  9 17:29:02 fwintern kernel:   1: @cbef4210  length 800005ea
status 000105ea
Oct  9 17:29:02 fwintern kernel:   2: @cbef4220  length 800005ea
status 000105ea
Oct  9 17:29:02 fwintern kernel:   3: @cbef4230  length 800005ea
status 800105ea
Oct  9 17:29:02 fwintern kernel:   4: @cbef4240  length 800005ea
status 800105ea
Oct  9 17:29:02 fwintern kernel:   5: @cbef4250  length 800005ea
status 000105ea
Oct  9 17:29:02 fwintern kernel:   6: @cbef4260  length 800005ea
status 000105ea
Oct  9 17:29:02 fwintern kernel:   7: @cbef4270  length 800005ea
status 000105ea
Oct  9 17:29:02 fwintern kernel:   8: @cbef4280  length 800005ea
status 000105ea
Oct  9 17:29:02 fwintern kernel:   9: @cbef4290  length 800005ea
status 000105ea
Oct  9 17:29:02 fwintern kernel:   10: @cbef42a0  length 800005ea
status 000105ea
Oct  9 17:29:02 fwintern kernel:   11: @cbef42b0  length 800005ea
status 000105ea
Oct  9 17:29:02 fwintern kernel:   12: @cbef42c0  length 800005ea
status 000105ea
Oct  9 17:29:02 fwintern kernel:   13: @cbef42d0  length 800005ea
status 000105ea
Oct  9 17:29:02 fwintern kernel:   14: @cbef42e0  length 800005ea
status 000105ea
Oct  9 17:29:02 fwintern kernel:   15: @cbef42f0  length 800005ea
status 000105ea
Oct  9 17:29:02 fwintern kernel: eth0: Resetting the Tx ring pointer.
Oct  9 17:29:02 fwintern kernel: Packet log: input DENY eth1 PROTO=17
10.1.1.200:138 10.1.2.2:138 L=257 S=0x00 I=34892 F=0x0000 T=126 (#3)
Oct  9 17:29:12 fwintern kernel: eth0: transmit timed out, tx_status
00 status e601.
Oct  9 17:29:12 fwintern kernel: eth0: Interrupt posted but not
delivered -- IRQ blocked by another device?
Oct  9 17:29:12 fwintern kernel:   Flags; bus-master 1, full 0; dirty
6543285 current 6543285.
Oct  9 17:29:12 fwintern kernel:   Transmit list 00000000 vs.
cbef4250.
... repeating ...

The problem was reproducible (several times) with the same download (a
300MB file) after a reboot. 

There should be no irq collision on irq 9, /proc/interrupts:

           CPU0       CPU1       
  0:    4444438    4611949    IO-APIC-edge  timer
  1:       2077       2163    IO-APIC-edge  keyboard
  2:          0          0          XT-PIC  cascade
  5:     390236     391327   IO-APIC-level  eth1
  8:          2          0    IO-APIC-edge  rtc
  9:     435924     436646   IO-APIC-level  eth0
 10:         17         17   IO-APIC-level  ncr53c8xx
 11:      24525      24737   IO-APIC-level  ida0
 13:          0          0          XT-PIC  fpu
NMI:          0
ERR:          0

No other hardware is installed.

Another observation: Normally the free (non buffer) memory of that machine
is 3MB (of 192MB). At the point of the crash it was down to 1.7MB. I don't
know if this is the cause or just a symptom of the locked up ethernet
card.

Is this a known problem? If it's not a hardware/driver bug, I suspect an
SMP related race condition. Maybe the bios did set up the cpu's / io-apics
wrong? I should mention that now, two days later, I can no longer
reproduce the crash. 

This might be due to 2 reasons: 1) The bandwidth when downloading from the
web server was measurably slower (about 1/2) this time (about 600kbit (of
a 2Mb link)) today. 2) the machine was power cycled. Unfortunately the
personnel cannot remember if the problem was only reproducible after a
warm reboot or also after a power cycle.

Finally: /proc/version:
Linux version 2.2.14 (root@fwintern) (gcc version 2.95.2 19991024
(release)) #4 SMP Thu Aug 3 20:32:32 CEST 2000

The recent discussion on linux-kernel seems to indicate that gcc 2.95
cannot compile a kernel at all. Could this be the cause here (the system
was up since that Aug 3 flawlessly). Unfortunately there is no other
version available on that system and the distributor (Suse) seems to deny
any gcc 2.95 - kernel incompatibilies.

Please advise,
Michael.

--

Michael Weller: eowmob@exp-math.uni-essen.de, eowmob@ms.exp-math.uni-essen.de,
or even mat42b@spi.power.uni-essen.de. If you encounter an eowmob account on
any machine in the net, it's very likely it's me.

-
: send the line "unsubscribe linux-net" in
the body of a message to majordomo@vger.kernel.org


[Index of Archives]     [Netdev]     [Ethernet Bridging]     [Linux 802.1Q VLAN]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Git]     [Bugtraq]     [Yosemite News and Information]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux PCI]     [Linux Admin]     [Samba]

  Powered by Linux