Re: memory testing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 2020-07-15 at 11:11 -0600, Chris Murphy wrote:
> Hi,
> 
> While bad RAM is uncommon, it comes up with some regularity to cause
> folks a lot of grief. I'm wondering if there's a way to make it
> easier
> to get bad news :-\ In particular there are cases where RAM defects
> just don't show up with a few hours of memtest86+, it can take days
> of
> contiguous testing, which is so inconvenient the test itself seems
> worse.

An interesting feature many people don't know about is EDAC for ECC
RAM. When a memory error occurs, the kernel will log a message like:

EDAC MC0: CE page 0x6ba7a, offset 0x800, grain 128, syndrome 0xf8, row
0, channel 0, label "": i3000 CE

and keep a running count (since boot) under
/sys/devices/system/edac/mc. You can track down errors to a specific
memory stick (if you have a secret decoder ring for your motherboard).

At a previous employer, we wrote a custom nagios plugin to monitor that
and alert us for errors on our servers.

For more info, see edac-util and edac-ctl from the edac-utils package
and:

https://buttersideup.com/mediawiki/index.php/Main_Page

https://www.kernel.org/doc/html/latest/driver-api/edac.html

Of course you need ECC RAM, but if you care about memory errors, you
should be using it anyway.

> Here's what I've got so far:
> 
> 1. Fedora includes /boot/memtest86+-5.01 on every installation. But
> this is a legacy/BIOS program. The idea of recommending folks enable
> CSM/legacy BIOS just to test their RAM is questionable because it
> means disabling UEFI Secure Boot to do it. Lie in wait malware is
> perhaps rare but plausible.  UEFI native memtest86+ is not free so it
> can't be included. I kinda wonder if including this should be
> deprecated?
> 
> 2. The kernel has a built-in memory tester. Therefore it can run on
> anything. But how good is it? Is it worth enabling? Should it be
> enabled for all kernels or just debug kernels? The code is pretty
> simple, so will it catch only the worst cases of bad RAM?
> # CONFIG_MEMTEST is not set
> https://elixir.bootlin.com/linux/v5.8-rc4/source/mm/memtest.c
> 
> 3. "memory interface test" used at Google, Apache 2.0 license
> https://github.com/stressapptest/stressapptest
> 
> 4. "multiple concurrent kernel compiles" and "GCC seems to have
> memory
> usage patterns that reliably trigger memory errors that
> aren't caught by memtest"
> https://lore.kernel.org/linux-btrfs/799cf552-4612-56c5-b44d-59458119e2b0@xxxxxxxxx/
> 
> Example of btrfs catching a bit flip:
> https://lore.kernel.org/linux-btrfs/f42fc0d6-5dc9-dd15-9d61-53efb04fad33@xxxxxxx/
> And also, this is not a good example of a memory tester. Some of the
> time the corruption happens before the csum is computed so, it's not
> going to catch everything.
> 
> Any other ideas how to make this better?
> 
> Thanks,
> -- 
> Chris Murphy
-- 
Ken Gaillot <kgaillot@xxxxxxxxxx>
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Fedora Announce]     [Fedora Users]     [Fedora Kernel]     [Fedora Testing]     [Fedora Formulas]     [Fedora PHP Devel]     [Kernel Development]     [Fedora Legacy]     [Fedora Maintainers]     [Fedora Desktop]     [PAM]     [Red Hat Development]     [Gimp]     [Yosemite News]

  Powered by Linux