On 2/11/22 11:44 AM, David Rosenstrauch via arch-general wrote:
On 2/11/22 9:21 AM, Genes Lists via arch-general wrote:
Also it may be worthwhile running memcheck to be sure your memory is
not faulty.
Yeah that thought occurred to me as well. I ran a quick memtest86+ when
I first built the computer and it didn't show any issues. But perhaps I
should take some downtime and have it run one through the full 3 rounds.
Following up with an update to close the loop on this thread, for anyone
who's interested.
So bad news is: machine crashed again a couple of times the other day -
and right in the middle of a large and important pacman update, no less!
(150+ packages, including kernel, glibc, gcc, mariadb, bunch of other
key libraries. Machine wound up not being bootable, and took several
hours to recover. It was pretty ugly.) :-(
But good news is (aside from being able to recover from the failed
update): I think I may have finally pinned down what's been causing
these issues.
Long story short, I had an issue when I first built the box where the
machine wouldn't POST and boot, with the mobo's "DRAM" health LED
indicator lighting up. After much digging I was able to pin down that I
had fastened the CPU cooler down too tight. That apparently can bend
the mobo, and prevent some components from working correctly. In my
case it was one of the memory slots, which is located very close to the
CPU/cooler. (I figured it out for certain when I was able to boot with
only one of the 2 memory sticks installed.) Once I pinpointed the
issue, I loosened the cooler a bit and from then on I've been able to
repeatedly boot as normal with both sticks installed.
Or so I thought! I'm pretty sure now that this has actually still been
an issue, and that connectivity to one of the memory sticks has been
periodically cutting out in the middle of operations and so causing
these random crashes. Evidence: a) while debugging one of the recent
crashes I again hit the same issue where it wouldn't POST and showed the
same DRAM LED, and b) when I again took out one memory stick from the
slot in question and rebooted and it's been running without issue ever
since. (Although on 1/2 the RAM.) I'm pretty sure the issue isn't that
the memory stick is bad, as I've run memtests several times with no
issues. But I'll also try swapping sticks to confirm.
I'll give it a few more days of uptime to confirm that this was indeed
the issue, but I'm growing increasingly confident that that's the case.
Will spend some time reinstalling the cooler, thermal paste, and 2nd
RAM stick again when I get some time and try to get everything back to
full RAM capacity without crashing.
Many thanks to everyone for all the helpful suggestions!
DR