Re: how to debug hardware lockups?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]



On Tue, Nov 18, 2008 at 9:47 AM, Les Mikesell <lesmikesell@xxxxxxxxx> wrote:
>> Did you leave memtest86+ running for 2 days? I thought 1 or 2 cycles
>> would be good enough?
>>
>> I'm hoping to pick-up the server in the next 2 hours then I can see
>> what happens when I run memtest86+ or other tests
>
> Yes, apparently RAM errors can be subtle and only appear when certain
> adjacent bit patterns are stored - or when the moon is in a certain phase or
> something.
>
> --
>  Les Mikesell
>   lesmikesell@xxxxxxxxx

When we burn in machines to try to find errors we go with the day or
two run also.  The one fun thing that we found was that many times it
was temperature related.  It would crash in the rack but then when the
machine was removed to a test bench it would not exhibit the issue.
This is especially true when the machine under load would have both
the CPU and the memory taxed but then during the testing we could only
really tax one or the other using the existing tools.  So blocking a
bit of the air flow in the lab to heat up the case or being able to
test in the same data center environment helped a lot.

We have most errors show up either in the first 2 minutes of running a
memory test or using one the prime number calculations or it will take
a day or few to show up.

Rob
_______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
http://lists.centos.org/mailman/listinfo/centos

[Index of Archives]     [CentOS]     [CentOS Announce]     [CentOS Development]     [CentOS ARM Devel]     [CentOS Docs]     [CentOS Virtualization]     [Carrier Grade Linux]     [Linux Media]     [Asterisk]     [DCCP]     [Netdev]     [Xorg]     [Linux USB]
  Powered by Linux