On Wed, 2020-07-15 at 11:11 -0600, Chris Murphy wrote: > Hi, > > While bad RAM is uncommon, it comes up with some regularity to cause > folks a lot of grief. I'm wondering if there's a way to make it > easier > to get bad news :-\ In particular there are cases where RAM defects > just don't show up with a few hours of memtest86+, it can take days > of > contiguous testing, which is so inconvenient the test itself seems > worse. An interesting feature many people don't know about is EDAC for ECC RAM. When a memory error occurs, the kernel will log a message like: EDAC MC0: CE page 0x6ba7a, offset 0x800, grain 128, syndrome 0xf8, row 0, channel 0, label "": i3000 CE and keep a running count (since boot) under /sys/devices/system/edac/mc. You can track down errors to a specific memory stick (if you have a secret decoder ring for your motherboard). At a previous employer, we wrote a custom nagios plugin to monitor that and alert us for errors on our servers. For more info, see edac-util and edac-ctl from the edac-utils package and: https://buttersideup.com/mediawiki/index.php/Main_Page https://www.kernel.org/doc/html/latest/driver-api/edac.html Of course you need ECC RAM, but if you care about memory errors, you should be using it anyway. > Here's what I've got so far: > > 1. Fedora includes /boot/memtest86+-5.01 on every installation. But > this is a legacy/BIOS program. The idea of recommending folks enable > CSM/legacy BIOS just to test their RAM is questionable because it > means disabling UEFI Secure Boot to do it. Lie in wait malware is > perhaps rare but plausible. UEFI native memtest86+ is not free so it > can't be included. I kinda wonder if including this should be > deprecated? > > 2. The kernel has a built-in memory tester. Therefore it can run on > anything. But how good is it? Is it worth enabling? Should it be > enabled for all kernels or just debug kernels? The code is pretty > simple, so will it catch only the worst cases of bad RAM? > # CONFIG_MEMTEST is not set > https://elixir.bootlin.com/linux/v5.8-rc4/source/mm/memtest.c > > 3. "memory interface test" used at Google, Apache 2.0 license > https://github.com/stressapptest/stressapptest > > 4. "multiple concurrent kernel compiles" and "GCC seems to have > memory > usage patterns that reliably trigger memory errors that > aren't caught by memtest" > https://lore.kernel.org/linux-btrfs/799cf552-4612-56c5-b44d-59458119e2b0@xxxxxxxxx/ > > Example of btrfs catching a bit flip: > https://lore.kernel.org/linux-btrfs/f42fc0d6-5dc9-dd15-9d61-53efb04fad33@xxxxxxx/ > And also, this is not a good example of a memory tester. Some of the > time the corruption happens before the csum is computed so, it's not > going to catch everything. > > Any other ideas how to make this better? > > Thanks, > -- > Chris Murphy -- Ken Gaillot <kgaillot@xxxxxxxxxx> _______________________________________________ devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx