--- On Fri, 8/4/11, NeilBrown <neilb@xxxxxxx> wrote: > From: NeilBrown <neilb@xxxxxxx> > Subject: Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive > To: "Gavin Flower" <gavinflower@xxxxxxxxx> > Cc: linux-raid@xxxxxxxxxxxxxxx > Date: Friday, 8 April, 2011, 23:50 > On Fri, 8 Apr 2011 02:59:52 -0700 > (PDT) Gavin Flower <gavinflower@xxxxxxxxx> > wrote: > > > > > --- On Fri, 8/4/11, NeilBrown <neilb@xxxxxxx> > wrote: > > > > > From: NeilBrown <neilb@xxxxxxx> > > > Subject: Re: RAID6 data-check took almost 2 > hours, clicking sounds, system unresponsive [...] > > > Obviously there is some sort of hardware issue - > possible a > > > drive, possibly a > > > bus problem - I really don't know. > > > > > > Apart from that things look normal. > > > > > > What exactly did you want explained? > > > > > > NeilBrown > > > > I guess I was surprised that the RAID system appeared > normal and that it did not register any errors. I was > hoping to get an idea as to which drive was problematic. > > sdc2 was reporting read error. md/raid6 computed the > data from the other > devices and wrote it back to sdc2. This appeared to > work so md/raid6 assumed > everything was fine again. It reported this: > > Apr 7 08:42:08 saturn kernel: [210414.109880] > md/raid:md1: read error corrected (8 sectors at 17195840 on > sdc2) > > but didn't fail anything. > > > > > > I get the feeling, from your reply, that this is not > specifically a RAID problem, that it just happens to affect > a RAID array. > > No, it was clearly a disk-drive problem. > e.g. > Apr 7 14:42:12 saturn kernel: [231957.756023] > ata3.00: failed command: READ FPDMA QUEUED > > a READ command sent to a n 'ata' device failed. i.e. > disk error. > > > > > I had thought that the RAID system should have been > able to give me better diagnostics, but possibly I am being > (inadvertently) unreasonable! > > Well.... it did tell you that it got a read error and > corrected it. > > > > > > Not sure what the significance of this mismatch is, > and what I should do about it. > > # cat /sys/block/md2/md/mismatch_cnt > > 28904 > > # > > I'm not sure if read errors end up counting as > mismatches.. They seem to for > raid1. The raid6 code is more complex and I don't > feel like decoding it > right now. > > In terms of "what to do about it" - the first thing must be > to fix sdc. > Maybe there is a loose cable or a broken cable. Maybe > the device needs to be > replaced. > > Once you have resolved that and are fairly sure yours > drives are all working, > echo check > > /sys/block/md2/md/sync_action > > once that finishes mismatch_cnt should ideally be > zero. If it isn't, try > echo repair > > /sys/block/md2/md/sync_action > > but only do that if you are confident that your devices are > good. > This will result in the same mismatch_cnt. However a > subsequent 'check' > should then show zero. > > NeilBrown Thanks, I followed your suggestions and all 'appears' to be fine now. Reality was a wee bit more dramatic than I would have liked! Machine refused to boot this morning, complaining about disk errors. Fortunately, I had arranged for a hardware capable friend to come around. He adjusted the cable on the offending drive and I ran fsck twice (lots of alarming messages first time). On rebooting, the system came up, but a video driver problem prevented the desktop from working. Fortunately I was able to log in from another machine and apply your suggested remedy. After the repair, I rebooted and was able to get into my desktop, subsequent checks revealed the mismatch counts to be all zero (I checked the failed RAID array and the other 2) -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html