Em Wed, 16 Oct 2013 20:47:05 +0000 "Luck, Tony" <tony.luck@xxxxxxxxx> escreveu: > > Also, I suspect that, if an error happens to affect more than one DIMM > > (e. g. part of the location is not available for a given error), > > that the DIMM label will also not be properly shown. > > There are a couple of cases here: > > 1) There are a number of DIMMs behind some flaky h/w that introduces errors > that are apparently blamed onto each of those DIMMs. > > All we can do here is statistical correlations ... each error is reported independently, > it is up to some entity to notice the higher level topology connection. There is enough > information in the UEFI error record to do that (assuming that BIOS filled out the > necessary fields). > > 2) There is a single reported error that spans more than one DIMM. > > This can happen with a UC error in a pair of lock-step DIMMs. Since the error is UC > we know that two (or more) bits are bad. But we have no way to tell whether the > bad bits came from the same DIMM, or one bit from each (because we don't know > which bits are bad - if we knew that, we could fix them :-) The eMCA case should > log two subsections in this case - one for each of the lockstep DIMMs involved. A user > seeing this will should probably just replace both DIMMs to be safe. If they wanted to > diagnose further they should swap DIMMs around so this pair are no longer lockstepped > and see if they start seeing correctable errors from each of the split pair - or if the UC > errors move with one or the other of the DIMMs There's also a third case: mirrored memories. As a matter of coherency with hw-based reports, for cases (2) and (3), the error tracing should be displaying both memories that are affected by a UC error (or a CE error on a mirrored address space). Regards, Mauro -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html