To followup on my own message. Today I managed "proof" it's the hp100.c driver that's the cause of the errors below. When the machine started to become slow again, I quickly managed to type: "ifconfig eth0 down ; rmmod hp100" and instead of the normal slow to death path, the machine instantly became energetic again. Guess the hp100.c driver isn't SMP safe. Then again, it uses jiffie loops, so what can you expect :) Paul, who should really just junk HP100VG Anylan and get a real 100Mb network. ---------- Forwarded message ---------- Date: Thu, 31 Aug 2000 16:23:08 +0200 (MET DST) From: Paul Wouters <paul@xtdnet.nl> To: linux-smp@vger.kernel.org, linux-net@vger.kernel.org Subject: SMP/Network related oops (2.2.16) Message-ID: <Pine.LNX.4.21.0008311618300.5706-100000@duplo.xtdnet.nl> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII Content-ID: <Pine.LNX.4.21.0008311618302.5706@duplo.xtdnet.nl> Hi, Lately I've been updating our smp machine, and alongside built a second smp machine. The first one, apart from a "stuck on TLB" glitch two months ago never crashed. Lately, some changes had been made. One is they now both run Helix-gnome 1.2 with updates and a distributed net client, along with ofcourse the redhat 6.2 updates. Both machines have become highly unstable when running X on them. But that could just be a manifestation of the extra load the machines receive when it runs. I hardly believe Helix binaries are the cause here. All crashes so far showed no log entries whatsoever. The machine would suddenly become extremely slow, and in a matter of 3-5 seconds, the mouse would freeze along with the entire machine. Today, I managed to get a logentry, though ksymoops can't seem to read it (and I can't read/match the symbols for some odd reason). Aug 31 16:28:49 dupla kernel: Aug 31 16:28:49 dupla kernel: wait_on_bh, CPU 0: Aug 31 16:28:49 dupla kernel: irq: 1 [0 1] Aug 31 16:28:49 dupla kernel: bh: 1 [0 1] Aug 31 16:29:20 dupla kernel: <[c010be9d]> <[c0169cc2]> <[c0169d3d]> <[c017990d]> <[c0151d6f]> <[c013496b]> <[c0134ac7]> stuck on TLB IPI wait (CPU#0) Aug 31 16:29:20 dupla kernel: stuck on TLB IPI wait (CPU#0) Aug 31 16:29:20 dupla kernel: stuck on TLB IPI wait (CPU#0) After three of these, a fourth one happened on CPU#1, then it continued on CPU#0 again. This time I had managed to switch back to console mode just before the system froze completely, and managed to use SysRq-r to remount ro and SysRq-b to boot the machine. Ksymoops said: Warning (Oops_read): Code line not seen, dumping what data is available Trace; c010be9d <synchronize_bh+3d/50> Trace; c0169cc2 <tcp_listen_poll+12/50> Trace; c0169d3d <tcp_poll+3d/100> Trace; c017990d <inet_poll+21/2c> Trace; c0151d6f <sock_poll+1f/24> Trace; c013496b <do_poll+7b/dc> Trace; c0134ac7 <sys_poll+fb/17c> 819 warnings and 1 error issued. Results may not be reliable. The networkcard is an HP 100VG Anylan (driver hp100.o) If needed, I can provide access (including root) on the spare dual CPU machine. This machine is an Asus P2L97-DS, with two P-II Deschutes, 333Mhz. CPU#0 is stepping 0, CPU#1 is stepping 2. As I said, we have two dual CPU systems. The other one has the same symptoms, but is an Asus P2B-DS with two identical P-III KatMai's on 450Mhz, stepping 7. But I've never managed to get a log entry on that one. And since it's a production machine, I'm no longer running X on it [1]. Paul Wouters Xtended Internet [1] I felt really awfull running X on the NIS master to begin with :) -- Broerdijk 27 Postbus 170 Tel: 31-24-360 39 19 6523 GM Nijmegen 6500 AD Nijmegen Fax: 31-24-360 19 99 The Netherlands The Netherlands info@xtdnet.nl - : send the line "unsubscribe linux-net" in the body of a message to majordomo@vger.kernel.org