On Mon, 21 Oct 2013 19:01:33 +0400 Michael Tokarev <mjt@xxxxxxxxxx> wrote: > Hello. > > I've a raid1 array (composed of 4 drives, so it is a 4-fold > copy of data), and one of the drives has an unreadable (bad) > sector in the partition belonging to this array. > > When I run md 'repair' action, it hits the error place, the > kernel clearly returns an error, but md does not do anything > with it. For example: > > Oct 21 18:43:55 mother kernel: [190018.073098] md: requested-resync of RAID array md1 > Oct 21 18:43:55 mother kernel: [190018.093910] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. > Oct 21 18:43:55 mother kernel: [190018.114765] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for requested-resync. > Oct 21 18:43:55 mother kernel: [190018.136459] md: using 128k window, over a total of 2096064k. > Oct 21 18:45:11 mother kernel: [190094.091974] ata6.00: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x0 > Oct 21 18:45:11 mother kernel: [190094.114093] ata6.00: irq_stat 0x40000008 > Oct 21 18:45:11 mother kernel: [190094.135906] ata6.00: failed command: READ FPDMA QUEUED > Oct 21 18:45:11 mother kernel: [190094.157710] ata6.00: cmd 60/00:00:00:3b:3e/04:00:00:00:00/40 tag 0 ncq 524288 in > Oct 21 18:45:11 mother kernel: [190094.157710] res 41/40:00:29:3e:3e/00:00:00:00:00/40 Emask 0x409 (media error) <F> > Oct 21 18:45:11 mother kernel: [190094.202315] ata6.00: status: { DRDY ERR } > Oct 21 18:45:11 mother kernel: [190094.224517] ata6.00: error: { UNC } > Oct 21 18:45:11 mother kernel: [190094.248920] ata6.00: configured for UDMA/133 > Oct 21 18:45:11 mother kernel: [190094.271003] sd 5:0:0:0: [sdc] Unhandled sense code > Oct 21 18:45:11 mother kernel: [190094.293044] sd 5:0:0:0: [sdc] > Oct 21 18:45:11 mother kernel: [190094.314654] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE > Oct 21 18:45:11 mother kernel: [190094.336483] sd 5:0:0:0: [sdc] > Oct 21 18:45:11 mother kernel: [190094.357966] Sense Key : Medium Error [current] [descriptor] > Oct 21 18:45:11 mother kernel: [190094.379808] Descriptor sense data with sense descriptors (in hex): > Oct 21 18:45:11 mother kernel: [190094.402024] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 > Oct 21 18:45:11 mother kernel: [190094.424502] 00 3e 3e 29 > Oct 21 18:45:11 mother kernel: [190094.446338] sd 5:0:0:0: [sdc] > Oct 21 18:45:11 mother kernel: [190094.467995] Add. Sense: Unrecovered read error - auto reallocate failed > Oct 21 18:45:11 mother kernel: [190094.490075] sd 5:0:0:0: [sdc] CDB: > Oct 21 18:45:11 mother kernel: [190094.511870] Read(10): 28 00 00 3e 3b 00 00 04 00 00 > Oct 21 18:45:11 mother kernel: [190094.533829] end_request: I/O error, dev sdc, sector 4079145 > Oct 21 18:45:11 mother kernel: [190094.555800] ata6: EH complete > Oct 21 18:45:22 mother kernel: [190105.602687] md: md1: requested-resync done. > > There's no indication that raid code tried to re-write the bad spot, > and the bad block remains bad in the drive, so next read (direct from > the drive) return the same I/O error with the same kernel messages. > > Shouldn't `repair' action re-write the problem place? Yes it should. When end_sync_read() notices that BIO_UPTODATE isn't set it refuses to set R1BIO_Uptodate. When sync_request_write() notices that isn't set it calls fix_sync_read_error(). fix_sync_read_error then calls sync_page_io() for each page in the region and if that fails (as you would expect, it goes on to the next disk and the next until a working one is found. Then that block is written back to all those that failed. fix_sync_read_error doesn't report any success, but as it re-read the failing device you should see the SCSI read error reported a second time at least. Are you able to add some tracing and recompile the kernel and see if you can find out what is happening? e.g. if end_sync_read doesn't see BIO_UPTODATE, print something. if sync_request_write doesn't see R1BIO_Uptodate, print something when fix_sync_read_error calls sync_page_io, print something. ?? Thanks, NeilBrown > > This is kernel 3.10.15. > > Thank you! > > /mjt > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html
Attachment:
signature.asc
Description: PGP signature