On 05/20/2010 06:38 PM, Neil Brown wrote: > On Thu, 20 May 2010 17:29:37 -0500 > Trey Scarborough <treys@xxxxxxxxxxxxxx> wrote: > >> Neil Brown wrote: >>> On Thu, 20 May 2010 12:02:23 -0500 >>> Trey Scarborough <treys@xxxxxxxxxxxxxx> wrote: >>> >>> >>>> I have a raid 5 array with 9 disks and I have a mismatch_cnt that keeps >>>> growing. This is causing file corruption on the underlaying file systems >>>> as well. I can copy a group of 100 100mb files and then do a md5sum on >>>> them and 1-3 will be corrupt. If this is a drive that is bad is there >>>> anyway to run a report on the count per drive that these mismatches >>>> occur. I have run smarttools test and do not see one drive that stands >>>> out to be causing errors. Could something else be causing these errors? >>>> While a bad drive is certainly a possibility here, this is precisely the type of failure scenario that would make me suspect bad RAM, motherboard, or CPU. So I wouldn't rule those out as possibilities either. >>> >>> When RAID5 detects an inconsistency there is no way to know which device was >>> wrong. >>> SMART only detects some errors, not all. >>> I have had hard drives before which appears to have a single-bit error in >>> their internal buffer. No error would be reported, but data you read would >>> sometimes be wrong. >>> RAID5 cannot help you with this sort of error. >>> >>> I would suggest backing up all your data (if it isn't already to late), >>> breaking the array, and testing each device individually. >>> e.g. create a filesystem on the device and try copying data on and reading it >>> off. >>> >>> NeilBrown >>> >> Thats what I was afraid of. The problem I have is if I back it up >> knowing what data is bad. Luckily it appears to be a write error because >> once written and correct I can do sums on all the files and I do not see >> anymore errors. I was thinking that there might be a way of do a resync >> and turning up the debug somehow so that it would log the mismatches >> with both the drives that it was reading from at the time. I could then >> take that information and considering there are 9 drives in the array >> the one that comes out having the most should be the culprit. I could >> then remove that drive from the array and test it leaving the rest in a >> state that could be rebuilt and the data being consistant because the >> drive with the bad write errors would be removed. Is this something that >> might be possible? > > To detect a mismatch, raid5 reads from all drives in parallel, calculates the > parity across the data blocks and compares that to the parity block. > So no: something like that is not possible. > > only thing I can suggest: > > - add a write-intent bitmap so you can remove/re-add devices fairly cheaply > - create a v.large file. > - write random data to the file without truncating it. (use dd of=file > conv=notrunc) then read it back and see if it matches. If it does, then > this approach doesn't help. If it doesn't: > > 1 by 1, fail/remove a drive from the array. Write new random data to the > same file and read it back and compare. Then --readd the missing device. > I'm hoping that you will get an error every time except when the 'bad' > device has been removed. > > NeilBrown > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Doug Ledford <dledford@xxxxxxxxxx> GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband
Attachment:
signature.asc
Description: OpenPGP digital signature