On 7/15/20 1:11 PM, Chris Murphy wrote:
Hi, While bad RAM is uncommon, it comes up with some regularity to cause folks a lot of grief. I'm wondering if there's a way to make it easier to get bad news :-\ In particular there are cases where RAM defects just don't show up with a few hours of memtest86+, it can take days of contiguous testing, which is so inconvenient the test itself seems worse. Here's what I've got so far: 1. Fedora includes /boot/memtest86+-5.01 on every installation. But this is a legacy/BIOS program. The idea of recommending folks enable CSM/legacy BIOS just to test their RAM is questionable because it means disabling UEFI Secure Boot to do it. Lie in wait malware is perhaps rare but plausible. UEFI native memtest86+ is not free so it can't be included. I kinda wonder if including this should be deprecated? 2. The kernel has a built-in memory tester. Therefore it can run on anything. But how good is it? Is it worth enabling? Should it be enabled for all kernels or just debug kernels? The code is pretty simple, so will it catch only the worst cases of bad RAM? # CONFIG_MEMTEST is not set https://elixir.bootlin.com/linux/v5.8-rc4/source/mm/memtest.c
I wouldn't bother with CONFIG_MEMTEST. It's designed to work around the problem by calling memblock_reserve so the bad memory doesn't get used. It's nice if you want to keep a machine running but that doesn't sound like what we're going for.
3. "memory interface test" used at Google, Apache 2.0 license https://github.com/stressapptest/stressapptest 4. "multiple concurrent kernel compiles" and "GCC seems to have memory usage patterns that reliably trigger memory errors that aren't caught by memtest" https://lore.kernel.org/linux-btrfs/799cf552-4612-56c5-b44d-59458119e2b0@xxxxxxxxx/ Example of btrfs catching a bit flip: https://lore.kernel.org/linux-btrfs/f42fc0d6-5dc9-dd15-9d61-53efb04fad33@xxxxxxx/ And also, this is not a good example of a memory tester. Some of the time the corruption happens before the csum is computed so, it's not going to catch everything. Any other ideas how to make this better?
Detecting hardware faults is a very hard problem unfortunately. I brought up this question at a conference a few years ago in the context of determining real bugs from hardware issues and nobody had any great suggestions. Much of it ends up coming down to analyzing the crash you are seeing and trying to figure out what doesn't make sense. Doing multiple kernel compiles is probably about as effective as anything I know about. Thanks, Laura _______________________________________________ devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx