Jack Howarth wrote:
Well it appears while memtest86+ may be able to disable the ECC enabled
by the BIOS during its default testing mode, memtest86+ can't enable ECC if
it is disabled in the BIOS. FYI.
I noted one of my machines with ECC ram that happens to have the intel
865p ecc chipset immediately begun displaying these errors when upgraded
to FC5. On info provided on
http://buttersideup.com/edacwiki/WhyAmIgettingMemoryErrors :
1. ran memtest for 60 hours one weekend (but did not know to turn ecc
test enabled): OK
2. removed 1 stick RAM, UE (uncorrectable errors - but fewer).
3. other stick: UE (UE-but fewer).
4. both sticks in different (allowed) slots: UE (1 to 3 a second)
5. BIOS disable ECC: OK
6. BIOS re-enable ECC, spread spectrm off: UE
7. made a few guesses at memory timing: UE no change.
8. ran memtest for a few minutes with ECC test on: UEs shown.
As I understand it the EDAC module is simply reading the ECC results
from the chipset that talks to the RAM. It is difficult to decide if the
RAM is :
a. faulty in it's data values
b. faulty in it's parity values
c. problem with board design
d. problem with memory chipset design
etc.
I guess a good bet would be to install a different ECC ram flavour, or
to try more pessimistic memory timings.
I think the error correcting code is at the chipset level (the ECC ram
just provides the extra storage bit per byte that is needed to implement
the ECC code). Perhaps what is happening is the chipset detects and
fixes the single bit error - this would mean no data loss / corruption.
The difference is that now we know that the RAM itself is faulty (UE
errors in FC5), but at the moment few enough bits are bad at once so
that ECC can fix the errors - pretty cool. This would be why memtest
can't detect a problem - the chipset has detected the error and fixed it
before presenting the data to memtest... [sorry, I just nutted this out
myself :~)]
DaveT.
--
fedora-list mailing list
fedora-list@xxxxxxxxxx
To unsubscribe: https://www.redhat.com/mailman/listinfo/fedora-list