Hi Brecht,
On Wed, 28 Mar 2007, Brecht Vermeulen wrote:
we are running multiple systems with same motherboard and NICs and get
the same problems under heavy load, with e.g. rsync and network block
device. I debugged already the hell out of it with all options of the
NICs (offloading on/off, ASF, jumbo frames and normal frames, ...), 32
bit/64 bit but could always get network lockups, sometimes only after 4
hours of heavy load. I got e.g. also MCE errors sometimes, but also
machines without those errors got the locks.
Sorry to hear you're in the same boat.
We're still having the problem, but haven't had a chance recently to take
another look. I'm hoping to before the end of this week.
Were you able to notice any difference between having ASF enabled vs. ASF
disabled? We noticed that the driver could reset the adapter with ASF
disabled (I don't know how consistantly this could happen), but seemed to
NOT be able to reset with ASF enabled.
We're also having trouble with MCE's on other systems (bad memory), after
which our compute nodes (also H8SSL-i's) start spraying invalid crap onto
the network (after crash, attach another system w/ crossover cable and
watch from another machine, byte counters increases, packet counters do
not).
So, I guess there is something wrong with that motherboard (not sure if
it's only the NICs, only the motherboard, or the combination of both).
I'll bring you into a conversation I'm having with someone from SuperMicro
in another email thread.
For one of our production servers, we've put a 32 bit intel nic in a PCI
slot and it is stable now (although 1Gb/s is out of sight :-( ).
We're trying to avoid having to do this.
I'll send the other email shortly.
Thanks!
Paul
Paul Armor wrote:
Hi,
On Tue, 13 Mar 2007, Neil Horman wrote:
I'll summarize what our problems and config's are.
Problems - lockups on ethernet controllers under heavy NFS loads
(sometimes driver can/will reset, sometimes not)
systems completely lock up
Hardware - Supermicro H8SSL-i with onboard Broadcom 5704's (both clients
and servers)
Server config - 2.6.19 kernel (thus tg3 ver 3.69)
nfs-utils-1.0.7-13 FC4
NIC running at 4500 MTU
What on earth is that? I assume you are configured for jumbo frames
through your whole network, but why not bump your mtu all the way up
to 9000 then?
yes, we're configured to allow upto 9000 MTU, but we're using 4500 as
that was the intersection of performance with regards to switch topology
(don't ask), cpu overhead with the tg3 driver (in 2.6.11, at least), and
throughput (using a variety of canned benchmarky things).
Does the problem persist if you only use a 1500 byte MTU?
Don't know, we're theoretically in production mode (when the machines
are all up that the same time).
Failure caused by users building software in automounted FS's.
Can you get a sysrq-t when the system locks up?
Will try the next time it craps out, and I can still get console access.
Thanks,
Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-net" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+ UWM-LSC Group Systems Administrator parmor@xxxxxxxxxxxxxxxxxxxx +
+ Physics 462 +
+ U. of W. - Milwaukee +
+ PO Box 413 414-229-2677 +
+ Milwaukee, WI 53201 fax 414-229-5589 +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
To unsubscribe from this list: send the line "unsubscribe linux-net" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html