2.6.9 Cobalt Tulip lockups.

"Habeeb J. Dihu" <macgyver@xxxxxxx> · Mon, 27 Dec 2004 17:12:50 -0600

Firstly...hope everyone enjoyed (or is still enjoying) their holidays.

Just joined the mailing list.  Apologies on the longish post, but I wanted
to provide as much information as possible.

I've got a couple of Cobalt boxes (a RaQ2 and a Qube2) that are successfully
running Debian (as well as Gentoo) running the 2.6.9 branch of CVS plus
Peter Horton's patches.  The machines are quite stable and I really haven't
run into any issues until I start stress-testing networking.

Under high network loads when connected at 100TX FDX, I can always get the
Cobalts to lock up.  Under anything slower -- 100TX HDX, 10TX FDX, 10TX HDX
-- everything works fine.  This problem is specific to:

1.  Lots of data transfer to/from the Cobalt.
2.  Running at 100TX FDX.

The test scenario is:

Cobalt running 2.6.9CVS + PH's patches.  Cobalt is running an NFS server.
>From another machine on the network (also running at 100TX FDX), mount the
NFS export and copy something huge (in my case a directory that has about
2GB worth of files in it).

It'll usually get anywhere from 300MB to 800MB of data before the Cobalt
just locks up -- no kernel panic, just a hard lockup that necessitates
cycling power manually.

At first I thought it might be NFS, so I tried something way less kernel
dependent like FTP and still had the same problems.  I also recompiled the
kernel with:

CONFIG_TULIP_MWI=n
CONFIG_TULIP_MMIO=n
CONFIG_TULIP_NAPI=n
CONFIG_TULIP_NAPI_HW_MITIGATION=n

as well as set to yes to no avail.

As a last resort I turned on lots of debugging output (I set tulip_debug to
99) and finally I got something usable from the kernel:

eth0: MII status 782d, Link partner report 45e1.
eth0: 21143 negotiation status 000000c6, MII.
Badness in local_bh_enable at kernel/softirq.c:141
Call Trace: [<800b32c8>]  [<80084e28>]  [<80397ee8>]  [<80397f08>]
[<80398af4>]  [<8029a374>]  [<800b87ac>]  [<800ad20c>]  [<8029ccbc>]
[<802bc8b0>]  [<802bc918>]  [<802bcba8>]  [<802575bc>]  [<8027b370>]
[<800abe34>]  [<800abe34>]  [<800b3168>]  [<800abebc>]  [<800abf80>]
[<800ac6b8>]  [<8022e900>]  [<800ac34c>]  [<8022e900>]  [<800ac278>]
[<800ac174>]  [<80279dec>]  [<80279980>]  [<800b8954>]  [<802b9458>]
[<800b3168>]  [<80084808>]  [<800b3208>]  [<80084e18>]  [<80082908>]
[<802dc1bc>]  [<80083180>]  [<802d89d8>]  [<8030ffdc>]  [<80303260>]
[<802da298>]  [<80084e28>]  [<8029afb0>]  [<8029b330>]  [<80138718>]
[<802dad80>]  [<80134538>]  [<800a4198>]  [<802d07c0>]  [<803031ec>]
[<8030ffdc>]  [<80214364>]  [<80295864>]  [<800a7440>]  [<8030ffdc>]
[<80214364>]  [<80295894>]  [<80295864>]  [<8029b7b4>]  [<80303ab8>]
[<80303aa0>]  [<8020e0d4>]  [<800a7440>]  [<802120f4>]  [<800a40c8>]
[<80310070>]  [<80398698>]  [<80398840>]  [<8029ca50>]  [<801d6f8c>]
[<8029cd50>]  [<8039918c>]  [<801cab18>]  [<8039b554>]  [<8039c8a4>]
[<801d8068>]  [<80397740>]  [<800bdbe8>]  [<801c7398>]  [<801c7274>]
[<801c70cc>]  [<80086070>]  [<80086060>] 

I'd already deduced that it was probably a problem related to interrupts
(seems we have a lot of those issues on our lovely blue boxes).  Looking at
the relevant line in kernel/softirq.c yields:

void local_bh_enable(void)
{
        __local_bh_enable();
        WARN_ON(irqs_disabled());
        if (unlikely(!in_interrupt() &&
                     local_softirq_pending()))
                invoke_softirq();
        preempt_check_resched();
}
EXPORT_SYMBOL(local_bh_enable);

So it's clear that something's calling local_bh_enable while interrupts are
disabled, which they shouldn't be.  I can recreate this problem at will --
so it's definitely replicable.  I've really taken this as far as I can in
terms of debugging the problem on my own.  I'd appreciate any/all
assistance/direction in how to track down the culprit here and fix the
problem.

Thanks,

Habeeb.