Re: RAID1 scrub ignoring read errors?

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Mon, 3 Dec 2018 20:24:01 -0700

On Sun, Dec 2, 2018 at 10:51 AM Niklas Hambüchen <mail@xxxxxx> wrote:
>
> Hello,
>
> today I got alerted by mdadm via email that a disk on one of my servers failed.
>
> On the machine, I see /dev/sda1 as faulty:
>
>     Number   Major   Minor   RaidDevice State
>        0       0        0        0      removed
>        1       8       17        1      active sync   /dev/sdb1
>
>        0       8        1        -      faulty   /dev/sda1
>
> and in dmesg:
>
>     ata1.00: configured for UDMA/133
>     sd 0:0:0:0: [sda] tag#18 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
>     sd 0:0:0:0: [sda] tag#18 Sense Key : Illegal Request [current] [descriptor]
>     sd 0:0:0:0: [sda] tag#18 Add. Sense: Logical block address out of range
>     sd 0:0:0:0: [sda] tag#18 CDB: Write(16) 8a 00 00 00 00 00 00 06 40 10 00 00 00 08 00 00
>     blk_update_request: I/O error, dev sda, sector 409616
>     md: super_written gets error=-5
>     md/raid1:md0: Disk failure on sda1, disabling device.
>     md/raid1:md0: Operation continuing on 1 devices.
>
> Note this is a Write(16) error.
> However, scrolling up in dmesg, I see lots of Read(16) errors for *both* /dev/sda and /dev/sdb:
>
> For sdb, at [7723679.793801]:
>
>     ata3.00: exception Emask 0x0 SAct 0x7c SErr 0x0 action 0x0
>     ata3.00: irq_stat 0x40000008
>     ata3.00: failed command: READ FPDMA QUEUED
>     ata3.00: cmd 60/00:10:00:6e:e4/0a:00:00:00:00/40 tag 2 ncq 1310720 in
>              res 41/40:00:30:73:e4/00:00:00:00:00/40 Emask 0x409 (media error) <F>
>     ata3.00: status: { DRDY ERR }
>     ata3.00: error: { UNC }
>     ata3.00: configured for UDMA/133
>     sd 2:0:0:0: [sdb] tag#2 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
>     sd 2:0:0:0: [sdb] tag#2 Sense Key : Medium Error [current] [descriptor]
>     sd 2:0:0:0: [sdb] tag#2 Add. Sense: Unrecovered read error - auto reallocate failed
>     sd 2:0:0:0: [sdb] tag#2 CDB: Read(16) 88 00 00 00 00 00 00 e4 6e 00 00 00 0a 00 00 00
>     blk_update_request: I/O error, dev sdb, sector 14971696
>     ata3: EH complete

>
> For sda, at [7723688.533758]:
>
>     ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
>     ata1.00: irq_stat 0x40000008
>     ata1.00: failed command: READ FPDMA QUEUED
>     ata1.00: cmd 60/80:18:80:d4:e5/00:00:00:00:00/40 tag 3 ncq 65536 in
>              res 41/40:00:b8:d4:e5/00:00:00:00:00/40 Emask 0x409 (media error) <F>
>     ata1.00: status: { DRDY ERR }
>     ata1.00: error: { UNC }
>     ata1.00: configured for UDMA/133
>     sd 0:0:0:0: [sda] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
>     sd 0:0:0:0: [sda] tag#3 Sense Key : Medium Error [current] [descriptor]
>     sd 0:0:0:0: [sda] tag#3 Add. Sense: Unrecovered read error - auto reallocate failed
>     sd 0:0:0:0: [sda] tag#3 CDB: Read(16) 88 00 00 00 00 00 00 e5 d4 80 00 00 00 80 00 00
>     blk_update_request: I/O error, dev sda, sector 15062200
>     ata1: EH complete
>
> Why is it that only sda1 is marked as faulty when both sda and sdb had unrecovered read errors earlier?
> Does md consider only write failures real failures?
> How does the logic work?

A single write failure is considered fatal by md driver, unless
bad-block list is configured. I'm not sure off hand how many per unit
time, read errors are permitted before the device is considered
faulty. But so long as those read errors result in read from
alternative device and successful overwrite on the device+sector with
error, it's a fixup, and shouldn't be a problem and therefore not a
faulty device (the manufacturer would say, working as designed).

-- 
Chris Murphy