Mr. James W. Laferriere wrote:
One more thought below...
On Mon, 23 Jul 2007, Bill Davidsen wrote:
Mr. James W. Laferriere wrote:
Hello Andrew ,
On Tue, 17 Jul 2007, Andrew Burgess wrote:
The 'MCE's have been ongoing for sometime . I have replaced
every item
in the system except the chassis & scsi backplane & power
supply(750Watts) .
Everything . MB,cpu,memory,scsi controllers, ...
These MCE's only happen when I am trying to build or bonnie++
test the
md3 . It consists of (now 7+1spare) 146GB drives in the SuperMicro
SYS-6035B-8B's backplane attached to a LSI22320 .
Probably every old timer has a story about chasing a hardware problem
where changing the power supply finally fixed it. I keep spares now.
If an MCE (which means bad cpu) doesn't go away after changing the cpu
it would either have to be temperature, power or a bug in the MCE
code.
What else could it be?
Thank you for the idea of 'changing out the PS' . So I did it a
bit differant . I removed the system PS from the raid backplane &
dropped in a known good ps of proper wattage & re-tested . But left
the systems ps attached to only the MB & fans .
It doesn't appear to be power load related . I tried rebuilding
my 7 disk raid6 array & I got the same thing , MCE .
Now the raid backplane is still in the air stream in front of
the cpu's and memory slots . So it could be a marginal cpu or
memory stick .
But here's the clincher , when I don't use the two drives in
from of the PS & cpu & memory slots . The array completes it's
resync . So I'm back to testing memory (again) , If that passes
then I'll try the new cpu(s) route .
It does sound like a cooling problem, which does not have to imply
the overheated parts are bad, although that may be true.
Fyi , memtest86+ @ 19 passes (~ 52hours) on 8GB of memory , no
errors .
Could be the total number of i/o in flight, etc.
Hmmm , I didn't think of this one .
Those are a PITA to find of that's it, doesn't sound likely to be power
supply, as an unlikely but cheap test, have you reseated the p/s to
backplane connectors? Oh and checked that the system board is grounded
to the case?
Have you tried dropping two other drives?
Well , no . I dropped those two in front of the CPU as a test in
working my way up the scsi backplane(BP) trying to find a point that
worked & the last two drives in the BP just happened to be in front of
the cpu/memory air path . The minute I put those in the MD build tree
within the usual time frame I get a MCE . What I have'nt tried is
what you are probably suggesting make sure it is the drives in the air
path by putting them in the MD build and leaving another two out .
I'll try that as well .
Can you put in a bit more fan?
Nope , It's maxed out . sounds like a 747 on take off as it is .
It's a supermicro SYS-6035B-8B if you have the time to go look at
the specs & pics .
What I was thinking is that some of my cases actually have room to
install fans in front of the drives, allowing push as well as pull.
Haven't had to do it in several years, but looking at my tall tower
cases, I believe I could.
Read the system board and CPU temps with the "sensors" package?
Not yet , I am building the need items into the kernel now .
Will report back (hopefully) sometime this weekend .
Keep us posted, you have picked the low-hanging fruit, when you find out
what causes this I'm sure it will be something interesting.
--
bill davidsen <davidsen@xxxxxxx>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html