Re: debugging RAM issues

Les Mikesell <lesmikesell@xxxxxxxxx> · Wed, 14 Mar 2012 14:55:00 -0500

On Wed, Mar 14, 2012 at 2:35 PM, John R Pierce <pierce@xxxxxxxxxxxx> wrote:
> On 03/14/12 12:16 PM, Les Mikesell wrote:
>> If you were running software RAID1 on that box, don't trust anything
>> on the drives now.   Maybe even if you weren't, but it is especially
>> weird when alternate reads randomly revive bad data that you thought
>> had been fixed already.
>
> and the worst part is, even if you found mismatching blocks on the
> mirrors, there's no way to know which one is the 'good' one, as there's
> no block checksumming or anything like that with conventional RAID.
>
> this is a major reason I *insist* on ECC for any sort of server other
> than a lightweight home system.   ECC memory will detect bit failures so
> you KNOW something is funky.

I _thought_ the server where I had this problem was supposed to have
had 1-bit error correction and I also thought that if the error
couldn't be corrected with ECC  it was supposed to crash instead of
continuing.  But maybe it had the wrong kind of RAM installed or
something that disabled the ECC.

-- 
   Les Mikesell
     lesmikesell@xxxxxxxxx
_______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
http://lists.centos.org/mailman/listinfo/centos