} -----Original Message----- } From: linux-raid-owner@xxxxxxxxxxxxxxx [mailto:linux-raid- } owner@xxxxxxxxxxxxxxx] On Behalf Of James } Sent: Thursday, December 30, 2010 8:48 PM } To: Neil Brown } Cc: linux-raid@xxxxxxxxxxxxxxx } Subject: Re: read errors corrected } } Neil, } } I'm runinng 2.6.35. } } Although an expensive route, the only thing I can think to do to } determine 100% whether the issue is software or hardware (and, if } hardware, whether SATA controller or the drives) is to swap the drives } out. } } Ouch! } } Any other ideas, however, would be appreciated before I drop a few } hundred bucks. :) Just swap out 1 for now? :) I believe your drives are fine because your smart stats don't reflect the number of errors you see in the logs. } } -james } } On Thu, Dec 30, 2010 at 23:12, Neil Brown <neilb@xxxxxxx> wrote: } > On Thu, 30 Dec 2010 11:35:59 -0500 James <jtp@xxxxxxxxx> wrote: } > } >> Sorry Neil, I meant to reply-all. } >> } >> -james } >> } >> On Thu, Dec 30, 2010 at 11:35, James <jtp@xxxxxxxxx> wrote: } >> > Inline. } >> > } >> > On Thu, Dec 30, 2010 at 04:15, Neil Brown <neilb@xxxxxxx> wrote: } >> >> On Thu, 30 Dec 2010 03:20:48 +0000 James <jtp@xxxxxxxxx> wrote: } >> >> } >> >>> All, } >> >>> } >> >>> I'm looking for a bit of guidance here. I have a RAID 6 set up on } my } >> >>> system and am seeing some errors in my logs as follows: } >> >>> } >> >>> # cat messages | grep "read erro" } >> >>> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8 } >> >>> sectors at 974262528 on sda4) } >> >>> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (8 } >> >>> sectors at 974262536 on sda4) } >> >> ..... } >> >> } >> >>> } >> >>> I've Google'd the heck out of this error message but am not seeing } a } >> >>> clear and concise message: is this benign? What would cause these } >> >>> errors? Should I be concerned? } >> >>> } >> >>> There is an error message (read error corrected) on each of the } drives } >> >>> in the array. They all seem to be functioning properly. The I/O on } the } >> >>> drives is pretty heavy for some parts of the day. } >> >>> } >> >>> # cat /proc/mdstat } >> >>> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] } >> >>> [raid4] [multipath] } >> >>> md1 : active raid6 sdb1[1] sda1[0] sdd1[3] sdc1[2] } >> >>> 497792 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU] } >> >>> } >> >>> md2 : active raid6 sdb2[1] sda2[0] sdd2[3] sdc2[2] } >> >>> 4000000 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU] } >> >>> } >> >>> md3 : active raid6 sdb3[1] sda3[0] sdd3[3] sdc3[2] } >> >>> 25992960 blocks level 6, 64k chunk, algorithm 2 [4/4] [UUUU] } >> >>> } >> >>> md4 : active raid6 sdb4[1] sda4[0] sdd4[3] sdc4[2] } >> >>> 2899780480 blocks level 6, 64k chunk, algorithm 2 [4/4] } [UUUU] } >> >>> } >> >>> unused devices: <none> } >> >>> } >> >>> I have a really hard time believing there's something wrong with } all } >> >>> of the drives in the array, although admittedly they're the same } model } >> >>> from the same manufacturer. } >> >>> } >> >>> Can someone point me in the right direction? } >> >>> (a) what causes these errors precisely? } >> >> } >> >> When md/raid6 tries to read from a device and gets a read error, it } try to } >> >> read from other other devices. When that succeeds it computes the } data that } >> >> it had tried to read and then write it back to the original drive. } If this } >> >> succeeded is assumes that the read error has been correct by a } write, and } >> >> prints the message that you see. } >> >> } >> >> } >> >>> (b) is the error benign? How can I determine if it is *likely* a } >> >>> hardware problem? (I imagine it's probably impossible to tell if } it's } >> >>> HW until it's too late) } >> >> } >> >> A few occasional messages like this are fairly benign. The could be } a sign } >> >> that the drive surface is degrading. If you see lots of these } messages, then } >> >> you should seriously consider replacing the drive. } >> > } >> > Wow, this is hard for me to believe considering this is happening on } >> > all the drives. It's not impossible, however, since the drives are } >> > likely from the same batch. } >> > } >> >> As you are seeing these message across all devices, it is possible } that the } >> >> problem is with the sata controller rather than the disks. Do know } which you } >> >> should check the errors that are reported in dmesg. If you don't } understand } >> >> these message, then post them to the list - feel free to post } several hundred } >> >> lines of logs - too much is much much better than not enough. } >> > } >> > I posted a few errors in my response to the thread a bit ago -- } here's } >> > another snippet: } >> > } >> > Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] Unhandled error code } >> > Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] Result: hostbyte=0x00 } >> > driverbyte=0x06 } >> > Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] CDB: cdb[0]=0x28: 28 } >> > 00 25 a2 a0 6a 00 00 80 00 } >> > Dec 29 01:55:03 nuova kernel: end_request: I/O error, dev sdc, sector } 631414890 } >> > Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] Unhandled error code } >> > Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] Result: hostbyte=0x00 } >> > driverbyte=0x06 } > } > "Unhandled error code" sounds like it could be a driver problem... } > } > Try googling that error message... } > } > http://us.generation-nt.com/answer/2-6-33-libata-issues-via-sata-pata- } controller-help-197123882.html } > } > } > "Also, please try the latest 2.6.34-rc kernel, as that has several fixes } > for both pata_via and sata_via which did not make 2.6.33." } > } > What kernel are you running??? } > } > NeilBrown } > } > } > } > } >> > Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] CDB: cdb[0]=0x28: 28 } >> > 00 25 a2 a0 ea 00 00 38 00 } >> > Dec 29 01:55:03 nuova kernel: end_request: I/O error, dev sdb, sector } 631415018 } >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 } >> > sectors at 600923648 on sdb4) } >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 } >> > sectors at 600923656 on sdb4) } >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 } >> > sectors at 600923664 on sdb4) } >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 } >> > sectors at 600923672 on sdb4) } >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 } >> > sectors at 600923680 on sdb4) } >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 } >> > sectors at 600923688 on sdb4) } >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 } >> > sectors at 600923696 on sdb4) } >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 } >> > sectors at 600923520 on sdc4) } >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 } >> > sectors at 600923528 on sdc4) } >> > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 } >> > sectors at 600923536 on sdc4) } >> > } >> > Is there a good way to determine if the issue is with the motherboard } >> > (where the SATA controller is), or with the drives themselves? } >> > } >> >> NeilBrown } >> >> } >> >> } >> >> } >> >>> (c) are these errors expected in a RAID array that is heavily used? } >> >>> (d) what kind of errors should I see regarding "read errors" that } >> >>> *would* indicate an imminent hardware failure? } >> >>> } >> >>> Thoughts and ideas would be welcomed. I'm sure a thread where some } >> >>> hefty discussion is thrown at this topic will help future Googlers } >> >>> like me. :) } >> >>> } >> >>> -james } >> >>> -- } >> >>> To unsubscribe from this list: send the line "unsubscribe linux- } raid" in } >> >>> the body of a message to majordomo@xxxxxxxxxxxxxxx } >> >>> More majordomo info at http://vger.kernel.org/majordomo-info.html } >> >> } >> >> } >> > } > } > } -- } To unsubscribe from this list: send the line "unsubscribe linux-raid" in } the body of a message to majordomo@xxxxxxxxxxxxxxx } More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html