On Wed, 08 Sep 2010 06:16:16 +0000 "Michael Sallaway" <michael@xxxxxxxxxxxx> wrote: > > > -------Original Message------- > > From: Neil Brown <neilb@xxxxxxx> > > To: Michael Sallaway <michael@xxxxxxxxxxxx> > > Cc: linux-raid@xxxxxxxxxxxxxxx > > Subject: Re: 3-way mirrors > > Sent: 08 Sep '10 06:02 > > > > Hmm.... Drive B shouldn't be ejected from the array for a read error. md > > should calculate the data for both A and B from the other devices and then > > write that to A and B. > > If the write fails, only then should it kick B from the array. Is that what > > is happening? > > > > i.e. do you see messages like: > > read error corrected > > read error not correctable > > read error NOT corrected > > > > in the kernel logs?? > > > The logs for the relevant section are below, at the bottom -- it's a "read error not correctable". So I'm guessing it's also failing a write, although I can't see the ATA error handling mentioning any writes -- it all looks like reads?? Yes, it is just reads. It looks like you have an ancient kernel - older than April 2010 :-) A patch went in to 2.6.35 and I think some 2.6.34.y which fixed a bug that causes md to drop devices in a degraded RAID6 when it could have fixed the read error. Commit 7b0bb5368a719 So a newer kernel might fix your problem for you. > > > > If the write is failing, then you want my bad-block-log patches - only they > > aren't really finished yet and certainly aren't tested very well. I really > > should get back to those. > > Interesting -- I'm not familiar with them, where would I find these patches? And what would they do -- just allow the bad blocks (even on writes), and keep the drive in the array? That's all I'm really after, in this case, I think. I posted them to the list for review a few months ago and haven't got back to them. http://www.spinics.net/lists/raid/msg28813.html I wouldn't recommend using them until they've seen more review and testing. NeilBrown > > Thanks! > Michael > > > > Syslog from the failure of the first drive: > > Sep 7 09:31:24 lechuck kernel: [51912.039892] ata13.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0 > Sep 7 09:31:24 lechuck kernel: [51912.048227] ata13.00: irq_stat 0x40000008 > Sep 7 09:31:24 lechuck kernel: [51912.056685] ata13.00: failed command: READ FPDMA QUEUED > Sep 7 09:31:24 lechuck kernel: [51912.065055] ata13.00: cmd 60/d8:08:00:20:d9/00:00:5d:00:00/40 tag 1 ncq 110592 in > Sep 7 09:31:24 lechuck kernel: [51912.065061] res 51/40:35:a3:20:d9/00:00:5d:00:00/40 Emask 0x409 (media error) <F> > Sep 7 09:31:25 lechuck kernel: [51912.098113] ata13.00: status: { DRDY ERR } > Sep 7 09:31:25 lechuck kernel: [51912.106705] ata13.00: error: { UNC } > Sep 7 09:31:25 lechuck kernel: [51912.128027] ata13.00: configured for UDMA/133 > Sep 7 09:31:25 lechuck kernel: [51912.128054] ata13: EH complete > Sep 7 09:31:28 lechuck kernel: [51915.216232] ata13.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0 > Sep 7 09:31:28 lechuck kernel: [51915.224757] ata13.00: irq_stat 0x40000008 > Sep 7 09:31:28 lechuck kernel: [51915.233283] ata13.00: failed command: READ FPDMA QUEUED > Sep 7 09:31:28 lechuck kernel: [51915.241660] ata13.00: cmd 60/d8:38:00:20:d9/00:00:5d:00:00/40 tag 7 ncq 110592 in > Sep 7 09:31:28 lechuck kernel: [51915.241662] res 41/40:35:a3:20:d9/00:00:5d:00:00/40 Emask 0x409 (media error) <F> > Sep 7 09:31:28 lechuck kernel: [51915.275603] ata13.00: status: { DRDY ERR } > Sep 7 09:31:28 lechuck kernel: [51915.284267] ata13.00: error: { UNC } > Sep 7 09:31:28 lechuck kernel: [51915.305722] ata13.00: configured for UDMA/133 > Sep 7 09:31:28 lechuck kernel: [51915.305746] ata13: EH complete > Sep 7 09:31:30 lechuck kernel: [51917.992164] ata13.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0 > Sep 7 09:31:30 lechuck kernel: [51918.000791] ata13.00: irq_stat 0x40000008 > Sep 7 09:31:30 lechuck kernel: [51918.009631] ata13.00: failed command: READ FPDMA QUEUED > Sep 7 09:31:30 lechuck kernel: [51918.018303] ata13.00: cmd 60/d8:08:00:20:d9/00:00:5d:00:00/40 tag 1 ncq 110592 in > Sep 7 09:31:30 lechuck kernel: [51918.018305] res 41/40:35:a3:20:d9/00:00:5d:00:00/40 Emask 0x409 (media error) <F> > Sep 7 09:31:30 lechuck kernel: [51918.054117] ata13.00: status: { DRDY ERR } > Sep 7 09:31:30 lechuck kernel: [51918.062808] ata13.00: error: { UNC } > Sep 7 09:31:30 lechuck kernel: [51918.084521] ata13.00: configured for UDMA/133 > Sep 7 09:31:30 lechuck kernel: [51918.084547] ata13: EH complete > Sep 7 09:31:33 lechuck kernel: [51920.956122] ata13.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0 > Sep 7 09:31:33 lechuck kernel: [51920.964858] ata13.00: irq_stat 0x40000008 > Sep 7 09:31:33 lechuck kernel: [51920.973829] ata13.00: failed command: READ FPDMA QUEUED > Sep 7 09:31:33 lechuck kernel: [51920.982587] ata13.00: cmd 60/d8:38:00:20:d9/00:00:5d:00:00/40 tag 7 ncq 110592 in > Sep 7 09:31:33 lechuck kernel: [51920.982589] res 41/40:35:a3:20:d9/00:00:5d:00:00/40 Emask 0x409 (media error) <F> > Sep 7 09:31:33 lechuck kernel: [51921.017401] ata13.00: status: { DRDY ERR } > Sep 7 09:31:33 lechuck kernel: [51921.026134] ata13.00: error: { UNC } > Sep 7 09:31:33 lechuck kernel: [51921.048656] ata13.00: configured for UDMA/133 > Sep 7 09:31:33 lechuck kernel: [51921.048680] ata13: EH complete > Sep 7 09:31:37 lechuck kernel: [51924.153414] ata13.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0 > Sep 7 09:31:37 lechuck kernel: [51924.162178] ata13.00: irq_stat 0x40000008 > Sep 7 09:31:37 lechuck kernel: [51924.162182] ata13.00: failed command: READ FPDMA QUEUED > Sep 7 09:31:37 lechuck kernel: [51924.162189] ata13.00: cmd 60/d8:08:00:20:d9/00:00:5d:00:00/40 tag 1 ncq 110592 in > Sep 7 09:31:37 lechuck kernel: [51924.162190] res 41/40:35:a3:20:d9/00:00:5d:00:00/40 Emask 0x409 (media error) <F> > Sep 7 09:31:37 lechuck kernel: [51924.162193] ata13.00: status: { DRDY ERR } > Sep 7 09:31:37 lechuck kernel: [51924.162195] ata13.00: error: { UNC } > Sep 7 09:31:37 lechuck kernel: [51924.175348] ata13.00: configured for UDMA/133 > Sep 7 09:31:37 lechuck kernel: [51924.175374] ata13: EH complete > Sep 7 09:31:39 lechuck kernel: [51927.005666] ata13.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0 > Sep 7 09:31:39 lechuck kernel: [51927.014384] ata13.00: irq_stat 0x40000008 > Sep 7 09:31:39 lechuck kernel: [51927.023299] ata13.00: failed command: READ FPDMA QUEUED > Sep 7 09:31:39 lechuck kernel: [51927.031949] ata13.00: cmd 60/d8:38:00:20:d9/00:00:5d:00:00/40 tag 7 ncq 110592 in > Sep 7 09:31:39 lechuck kernel: [51927.031951] res 41/40:35:a3:20:d9/00:00:5d:00:00/40 Emask 0x409 (media error) <F> > Sep 7 09:31:39 lechuck kernel: [51927.066322] ata13.00: status: { DRDY ERR } > Sep 7 09:31:39 lechuck kernel: [51927.074946] ata13.00: error: { UNC } > Sep 7 09:31:40 lechuck kernel: [51927.096349] ata13.00: configured for UDMA/133 > Sep 7 09:31:40 lechuck kernel: [51927.096393] sd 12:0:0:0: [sdm] Unhandled sense code > Sep 7 09:31:40 lechuck kernel: [51927.096396] sd 12:0:0:0: [sdm] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE > Sep 7 09:31:40 lechuck kernel: [51927.096401] sd 12:0:0:0: [sdm] Sense Key : Medium Error [current] [descriptor] > Sep 7 09:31:40 lechuck kernel: [51927.096406] Descriptor sense data with sense descriptors (in hex): > Sep 7 09:31:40 lechuck kernel: [51927.096409] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 > Sep 7 09:31:40 lechuck kernel: [51927.096420] 5d d9 20 a3 > Sep 7 09:31:40 lechuck kernel: [51927.096425] sd 12:0:0:0: [sdm] Add. Sense: Unrecovered read error - auto reallocate failed > Sep 7 09:31:40 lechuck kernel: [51927.096431] sd 12:0:0:0: [sdm] CDB: Read(10): 28 00 5d d9 20 00 00 00 d8 00 > Sep 7 09:31:40 lechuck kernel: [51927.096442] end_request: I/O error, dev sdm, sector 1574510755 > Sep 7 09:31:40 lechuck kernel: [51927.104975] raid5:md10: read error not correctable (sector 1574510752 on sdm). > Sep 7 09:31:40 lechuck kernel: [51927.104985] raid5: Disk failure on sdm, disabling device. > Sep 7 09:31:40 lechuck kernel: [51927.104989] raid5: Operation continuing on 10 devices. > Sep 7 09:31:40 lechuck kernel: [51927.122210] raid5:md10: read error not correctable (sector 1574510760 on sdm). > Sep 7 09:31:40 lechuck kernel: [51927.122214] raid5:md10: read error not correctable (sector 1574510768 on sdm). > Sep 7 09:31:40 lechuck kernel: [51927.122218] raid5:md10: read error not correctable (sector 1574510776 on sdm). > Sep 7 09:31:40 lechuck kernel: [51927.122222] raid5:md10: read error not correctable (sector 1574510784 on sdm). > Sep 7 09:31:40 lechuck kernel: [51927.122225] raid5:md10: read error not correctable (sector 1574510792 on sdm). > Sep 7 09:31:40 lechuck kernel: [51927.122229] raid5:md10: read error not correctable (sector 1574510800 on sdm). > Sep 7 09:31:40 lechuck kernel: [51927.122242] ata13: EH complete > Sep 7 09:31:40 lechuck kernel: [51927.142926] md: md10: recovery done. > Sep 7 09:31:40 lechuck mdadm[3840]: Fail event detected on md device /dev/md10, component device /dev/sdm > Sep 7 09:31:40 lechuck kernel: [51927.344026] RAID5 conf printout: > Sep 7 09:31:40 lechuck kernel: [51927.344031] --- rd:12 wd:10 > Sep 7 09:31:40 lechuck kernel: [51927.344034] disk 0, o:1, dev:sdf > Sep 7 09:31:40 lechuck kernel: [51927.344037] disk 1, o:1, dev:sdb > Sep 7 09:31:40 lechuck kernel: [51927.344039] disk 2, o:1, dev:sda > Sep 7 09:31:40 lechuck kernel: [51927.344042] disk 3, o:1, dev:sdc > Sep 7 09:31:40 lechuck kernel: [51927.344044] disk 4, o:1, dev:sdj > Sep 7 09:31:40 lechuck kernel: [51927.344047] disk 5, o:1, dev:sdi > Sep 7 09:31:40 lechuck kernel: [51927.344049] disk 6, o:1, dev:sdp > Sep 7 09:31:40 lechuck kernel: [51927.344052] disk 7, o:1, dev:sdn > Sep 7 09:31:40 lechuck kernel: [51927.344054] disk 8, o:1, dev:sdo > Sep 7 09:31:40 lechuck kernel: [51927.344057] disk 9, o:0, dev:sdm > Sep 7 09:31:40 lechuck kernel: [51927.344059] disk 10, o:1, dev:sdk > Sep 7 09:31:40 lechuck kernel: [51927.344062] disk 11, o:1, dev:sdl > Sep 7 09:31:40 lechuck kernel: [51927.344064] RAID5 conf printout: > Sep 7 09:31:40 lechuck kernel: [51927.344066] --- rd:12 wd:10 > Sep 7 09:31:40 lechuck kernel: [51927.344068] disk 0, o:1, dev:sdf > Sep 7 09:31:40 lechuck kernel: [51927.344070] disk 1, o:1, dev:sdb > Sep 7 09:31:40 lechuck kernel: [51927.344073] disk 2, o:1, dev:sda > Sep 7 09:31:40 lechuck kernel: [51927.344075] disk 3, o:1, dev:sdc > Sep 7 09:31:40 lechuck kernel: [51927.344077] disk 4, o:1, dev:sdj > Sep 7 09:31:40 lechuck kernel: [51927.344080] disk 5, o:1, dev:sdi > Sep 7 09:31:40 lechuck kernel: [51927.344082] disk 6, o:1, dev:sdp > Sep 7 09:31:40 lechuck kernel: [51927.344084] disk 7, o:1, dev:sdn > Sep 7 09:31:40 lechuck kernel: [51927.344087] disk 8, o:1, dev:sdo > Sep 7 09:31:40 lechuck kernel: [51927.344089] disk 9, o:0, dev:sdm > Sep 7 09:31:40 lechuck kernel: [51927.344091] disk 10, o:1, dev:sdk > Sep 7 09:31:40 lechuck kernel: [51927.344093] disk 11, o:1, dev:sdl > Sep 7 09:31:40 lechuck kernel: [51927.344095] RAID5 conf printout: > Sep 7 09:31:40 lechuck kernel: [51927.344097] --- rd:12 wd:10 > Sep 7 09:31:40 lechuck kernel: [51927.344100] disk 0, o:1, dev:sdf > Sep 7 09:31:40 lechuck kernel: [51927.344102] disk 1, o:1, dev:sdb > Sep 7 09:31:40 lechuck kernel: [51927.344104] disk 2, o:1, dev:sda > Sep 7 09:31:40 lechuck kernel: [51927.344106] disk 3, o:1, dev:sdc > Sep 7 09:31:40 lechuck kernel: [51927.344109] disk 4, o:1, dev:sdj > Sep 7 09:31:40 lechuck kernel: [51927.344111] disk 5, o:1, dev:sdi > Sep 7 09:31:40 lechuck kernel: [51927.344113] disk 6, o:1, dev:sdp > Sep 7 09:31:40 lechuck kernel: [51927.344116] disk 7, o:1, dev:sdn > Sep 7 09:31:40 lechuck kernel: [51927.344118] disk 8, o:1, dev:sdo > Sep 7 09:31:40 lechuck kernel: [51927.344120] disk 9, o:0, dev:sdm > Sep 7 09:31:40 lechuck kernel: [51927.344122] disk 10, o:1, dev:sdk > Sep 7 09:31:40 lechuck kernel: [51927.344125] disk 11, o:1, dev:sdl > Sep 7 09:31:40 lechuck kernel: [51927.400014] RAID5 conf printout: > Sep 7 09:31:40 lechuck kernel: [51927.400017] --- rd:12 wd:10 > Sep 7 09:31:40 lechuck kernel: [51927.400020] disk 0, o:1, dev:sdf > Sep 7 09:31:40 lechuck kernel: [51927.400022] disk 1, o:1, dev:sdb > Sep 7 09:31:40 lechuck kernel: [51927.400025] disk 2, o:1, dev:sda > Sep 7 09:31:40 lechuck kernel: [51927.400027] disk 3, o:1, dev:sdc > Sep 7 09:31:40 lechuck kernel: [51927.400029] disk 4, o:1, dev:sdj > Sep 7 09:31:40 lechuck kernel: [51927.400032] disk 5, o:1, dev:sdi > Sep 7 09:31:40 lechuck kernel: [51927.400034] disk 6, o:1, dev:sdp > Sep 7 09:31:40 lechuck kernel: [51927.400036] disk 7, o:1, dev:sdn > Sep 7 09:31:40 lechuck kernel: [51927.400039] disk 8, o:1, dev:sdo > Sep 7 09:31:40 lechuck kernel: [51927.400041] disk 10, o:1, dev:sdk > Sep 7 09:31:40 lechuck kernel: [51927.400043] disk 11, o:1, dev:sdl > Sep 7 09:31:40 lechuck kernel: [51927.400138] md: recovery of RAID array md10 > Sep 7 09:31:40 lechuck kernel: [51927.400141] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. > Sep 7 09:31:40 lechuck kernel: [51927.400145] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. > Sep 7 09:31:40 lechuck kernel: [51927.400155] md: using 128k window, over a total of 1465138496 blocks. > Sep 7 09:31:40 lechuck kernel: [51927.400159] md: resuming recovery of md10 from checkpoint. > Sep 7 09:31:40 lechuck mdadm[3840]: RebuildFinished event detected on md device /dev/md10, component device mismatches found: 477544 > Sep 7 09:31:40 lechuck mdadm[3840]: RebuildStarted event detected on md device /dev/md10 > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html