hp100.c is the cause, was: SMP/Network related oops (2.2.16) (fwd)

Paul Wouters <paul@xtdnet.nl> · Tue, 5 Sep 2000 12:01:10 +0200 (MET DST)

To followup on my own message. Today I managed "proof" it's the hp100.c driver
that's the cause of the errors below. When the machine started to become slow
again, I quickly managed to type: "ifconfig eth0 down ; rmmod hp100" and
instead of the normal slow to death path, the machine instantly became
energetic again.

Guess the hp100.c driver isn't SMP safe. Then again, it uses jiffie loops,
so what can you expect :)

Paul, who should really just junk HP100VG Anylan and get a real 100Mb network.

---------- Forwarded message ----------
Date: Thu, 31 Aug 2000 16:23:08 +0200 (MET DST)
From: Paul Wouters <paul@xtdnet.nl>
To: linux-smp@vger.kernel.org, linux-net@vger.kernel.org
Subject: SMP/Network related oops (2.2.16)
Message-ID: <Pine.LNX.4.21.0008311618300.5706-100000@duplo.xtdnet.nl>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Content-ID: <Pine.LNX.4.21.0008311618302.5706@duplo.xtdnet.nl>

Hi,

Lately I've been updating our smp machine, and alongside built  a second
smp machine. The first one, apart from a "stuck on TLB" glitch two months ago
never crashed. 
Lately, some changes had been made. One is they now both run Helix-gnome 1.2
with updates and a distributed net client, along with ofcourse the redhat 6.2
updates.

Both machines have become highly unstable when running X on them. But that
could just be a manifestation of the extra load the machines receive when
it runs. I hardly believe Helix binaries are the cause here.

All crashes so far showed no log entries whatsoever. The machine would suddenly
become extremely slow, and in a matter of 3-5 seconds, the mouse would freeze
along with the entire machine. Today, I managed to get a logentry, though
ksymoops can't seem to read it (and I can't read/match the symbols for some
odd reason).

Aug 31 16:28:49 dupla kernel:
Aug 31 16:28:49 dupla kernel: wait_on_bh, CPU 0:
Aug 31 16:28:49 dupla kernel: irq:  1 [0 1]
Aug 31 16:28:49 dupla kernel: bh:   1 [0 1]
Aug 31 16:29:20 dupla kernel: <[c010be9d]> <[c0169cc2]> <[c0169d3d]> <[c017990d]> <[c0151d6f]> <[c013496b]> <[c0134ac7]> stuck on TLB IPI wait (CPU#0)
Aug 31 16:29:20 dupla kernel: stuck on TLB IPI wait (CPU#0)
Aug 31 16:29:20 dupla kernel: stuck on TLB IPI wait (CPU#0)

After three of these, a fourth one happened on CPU#1, then it continued on 
CPU#0 again. This time I had managed to switch back to console mode just 
before the system froze completely, and managed to use SysRq-r to remount ro
and SysRq-b to boot the machine.

Ksymoops said:

Warning (Oops_read): Code line not seen, dumping what data is available

Trace; c010be9d <synchronize_bh+3d/50>
Trace; c0169cc2 <tcp_listen_poll+12/50>
Trace; c0169d3d <tcp_poll+3d/100>
Trace; c017990d <inet_poll+21/2c>
Trace; c0151d6f <sock_poll+1f/24>
Trace; c013496b <do_poll+7b/dc>
Trace; c0134ac7 <sys_poll+fb/17c>

819 warnings and 1 error issued.  Results may not be reliable.

The networkcard is an HP 100VG Anylan (driver hp100.o)

If needed, I can provide access (including root) on the spare dual CPU 
machine.

This machine is an Asus P2L97-DS, with two P-II Deschutes, 333Mhz. CPU#0 is
stepping 0, CPU#1 is stepping 2.

As I said, we have two dual CPU systems. The other one has the same symptoms,
but is an Asus P2B-DS with two identical P-III KatMai's on 450Mhz, stepping 7.
But I've never managed to get a log entry on that one. And since it's a 
production machine, I'm no longer running X on it [1].

Paul Wouters
Xtended Internet

[1] I felt really awfull running X on the NIS master to begin with :)
--
Broerdijk 27			Postbus 170		Tel: 31-24-360 39 19	
6523 GM Nijmegen		6500 AD Nijmegen	Fax: 31-24-360 19 99
The Netherlands			The Netherlands		info@xtdnet.nl

-
: send the line "unsubscribe linux-net" in
the body of a message to majordomo@vger.kernel.org