On 05/26/2010 11:07 AM, Bill Davidsen wrote: > Doug Ledford wrote: >> On 05/20/2010 06:38 PM, Neil Brown wrote: >> >>> On Thu, 20 May 2010 17:29:37 -0500 >>> Trey Scarborough <treys@xxxxxxxxxxxxxx> wrote: >>> >>> >>>> Neil Brown wrote: >>>> >>>>> On Thu, 20 May 2010 12:02:23 -0500 >>>>> Trey Scarborough <treys@xxxxxxxxxxxxxx> wrote: >>>>> >>>>> >>>>>> I have a raid 5 array with 9 disks and I have a mismatch_cnt that >>>>>> keeps growing. This is causing file corruption on the underlaying >>>>>> file systems as well. I can copy a group of 100 100mb files and >>>>>> then do a md5sum on them and 1-3 will be corrupt. If this is a >>>>>> drive that is bad is there anyway to run a report on the count per >>>>>> drive that these mismatches occur. I have run smarttools test and >>>>>> do not see one drive that stands out to be causing errors. Could >>>>>> something else be causing these errors? >>>>>> >> >> While a bad drive is certainly a possibility here, this is precisely the >> type of failure scenario that would make me suspect bad RAM, >> motherboard, or CPU. So I wouldn't rule those out as possibilities >> either. >> > > I have the same thought, I would remove half the RAM from the system and > test again, then swap to the "other" half and repeat. Of course running > memtest first is a good idea, but I have seen failures which only happen > on disk access. Indeed, I've seen lots of failures that only happen with disk access and not with memory testers. Hence why I have a shell script on my web page in my sig that uses disk access to test memory. > If the system is O/C obviously the first step is to cut the speed back... > -- Doug Ledford <dledford@xxxxxxxxxx> GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband
Attachment:
signature.asc
Description: OpenPGP digital signature