Re: md resync ignoring unreadable sectors

Eyal Lebedinsky <eyal@xxxxxxxxxxxxxx> · Sun, 08 Feb 2015 09:39:47 +1100

On 08/02/15 08:47, Roman Mamedov wrote:
Hello,

I've got some bad sectors on one drive:

dd: reading `/dev/sdh1': Input/output error
260200+0 records in
260200+0 records out
133222400 bytes (133 MB) copied, 2.97188 s, 44.8 MB/s

[ 3908.350331] ata9.00: exception Emask 0x0 SAct 0x40000 SErr 0x0 action 0x0
[ 3908.350385] ata9.00: irq_stat 0x40000008
[ 3908.350427] ata9.00: failed command: READ FPDMA QUEUED
[ 3908.350474] ata9.00: cmd 60/06:90:6a:00:04/00:00:00:00:00/40 tag 18 ncq 3072 in
[ 3908.350474]          res 51/40:06:6a:00:04/00:00:00:00:00/40 Emask 0x409 (media error) <F>
[ 3908.350628] ata9.00: status: { DRDY ERR }
[ 3908.350669] ata9.00: error: { UNC }
[ 3908.354643] ata9.00: configured for UDMA/133
[ 3908.354664] sd 8:0:0:0: [sdh] Unhandled sense code
[ 3908.354668] sd 8:0:0:0: [sdh]
[ 3908.354671] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 3908.354674] sd 8:0:0:0: [sdh]
[ 3908.354677] Sense Key : Medium Error [current] [descriptor]
[ 3908.354681] Descriptor sense data with sense descriptors (in hex):
[ 3908.354683]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
[ 3908.354695]         00 04 00 6a
[ 3908.354701] sd 8:0:0:0: [sdh]
[ 3908.354705] Add. Sense: Unrecovered read error - auto reallocate failed
[ 3908.354708] sd 8:0:0:0: [sdh] CDB:
[ 3908.354710] Read(10): 28 00 00 04 00 6a 00 00 06 00
[ 3908.354721] end_request: I/O error, dev sdh, sector 262250
[ 3908.354773] Buffer I/O error on device sdh1, logical block 260202
[ 3908.354825] Buffer I/O error on device sdh1, logical block 260203
[ 3908.354891] Buffer I/O error on device sdh1, logical block 260204
[ 3908.354942] Buffer I/O error on device sdh1, logical block 260205
[ 3908.354992] Buffer I/O error on device sdh1, logical block 260206
[ 3908.355042] Buffer I/O error on device sdh1, logical block 260207
[ 3908.355125] ata9: EH complete

Generally I believe these should go away when overwritten, but how do I
overwrite them? The drive is an md RAID1 member:

/dev/md4:
         Version : 1.2
   Creation Time : Mon May 26 13:40:18 2014
      Raid Level : raid1
      Array Size : 1953379936 (1862.89 GiB 2000.26 GB)
   Used Dev Size : 1953379936 (1862.89 GiB 2000.26 GB)
    Raid Devices : 2
   Total Devices : 2
     Persistence : Superblock is persistent

   Intent Bitmap : Internal

     Update Time : Sun Feb  8 02:39:58 2015
           State : active
  Active Devices : 2
Working Devices : 2
  Failed Devices : 0
   Spare Devices : 0

            Name : natsu.romanrm.net:4  (local to host natsu.romanrm.net)
            UUID : 3b8c3166:073249b5:e1384bd6:4611df90
          Events : 50426

     Number   Major   Minor   RaidDevice State
        0       8       49        0      active sync   /dev/sdd1
        1       8      113        1      active sync   /dev/sdh1

I thought I would run a 'check' or 'repair', this will read from both drives,
fail to read from sdh, then try to overwrite the affected areas on sdh. But
nope:

# echo 0 > /sys/block/md4/md/sync_min
# echo check > /sys/block/md4/md/sync_action

[ 4059.451036] md: data-check of RAID array md4
[ 4059.451040] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[ 4059.451042] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[ 4059.451046] md: using 128k window, over a total of 1953379936k.

This happily proceeds through the supposedly unreadable area:

md4 : active raid1 sdd1[0] sdh1[1]
       1953379936 blocks super 1.2 [2/2] [UU]
       [>....................]  check =  0.0% (1479680/1953379936) finish=1116.8min speed=29128K/sec
       bitmap: 2/8 pages [8KB], 131072KB chunk

at 1.5GB already, while the unreadable sectors are at ~133MB. And no new ATA
errors in dmesg. How is this possible?

If I retry the 'dd' command right now, it fails exactly in the same way as
before (and ATA errors do indeed appear).

Hi,

I had a similar situation. In my case the bad sectors fell in an unused control area, part of the header,
which is not read (or written) by the md normally or by the sync.

The error did not show up during normal operation (or during scrub), only during the smartctl long test.
What triggered the error for you?

I looked up the size of the different parts of the RAID to arrive at that conclusion. Dumping the sectors
around the bad area also showed it to be all zeroes.

I ended up directly zeroing the bad sectors (hdparm --repair-sector ...).

YMMV

--
Eyal Lebedinsky (eyal@xxxxxxxxxxxxxx)
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html