Re: An oddity: UNC error while re-adding/resyncing

Michael Evans <mjevans1983@xxxxxxxxx> · Thu, 25 Mar 2010 20:50:41 -0700

On Thu, Mar 25, 2010 at 5:43 PM, John Robinson
<john.robinson@xxxxxxxxxxxxxxxx> wrote:
> I did `mdadm --add /dev/md1 /dev/sdd2` and got the following in my kernel
> log:
>
> Mar 25 23:56:21 beast kernel: md: bind<sdd2>
> Mar 25 23:56:21 beast kernel: RAID5 conf printout:
> Mar 25 23:56:21 beast kernel:  --- rd:3 wd:2 fd:1
> Mar 25 23:56:21 beast kernel:  disk 0, o:1, dev:sda2
> Mar 25 23:56:21 beast kernel:  disk 1, o:1, dev:sdb2
> Mar 25 23:56:21 beast kernel:  disk 2, o:1, dev:sdd2
> Mar 25 23:56:21 beast kernel: md: syncing RAID array md1
> Mar 25 23:56:21 beast kernel: md: minimum _guaranteed_ reconstruction speed:
> 1000 KB/sec/disc.
> Mar 25 23:56:21 beast kernel: md: using maximum available idle IO bandwidth
> (but not more than 2
> 00000 KB/sec) for reconstruction.
> Mar 25 23:56:21 beast kernel: md: using 128k window, over a total of
> 976655360 blocks.
> Mar 25 23:56:22 beast kernel: ata3.00: exception Emask 0x0 SAct 0x3 SErr 0x0
> action 0x0
> Mar 25 23:56:22 beast kernel: ata3.00: irq_stat 0x40000008
> Mar 25 23:56:22 beast kernel: ata3.00: cmd
> 60/00:00:a5:3f:03/04:00:00:00:00/40 tag 0 ncq 524288
> in
> Mar 25 23:56:25 beast kernel:          res
> 41/40:00:a0:41:03/8c:00:00:00:00/40 Emask 0x409 (medi
> a error) <F>
> Mar 25 23:56:25 beast kernel: ata3.00: status: { DRDY ERR }
> Mar 25 23:56:26 beast kernel: ata3.00: error: { UNC }
> Mar 25 23:56:26 beast kernel: ata3.00: configured for UDMA/133
> Mar 25 23:56:26 beast kernel: ata3: EH complete
> Mar 25 23:56:26 beast kernel: SCSI device sda: 1953525168 512-byte hdwr
> sectors (1000205 MB)
> Mar 25 23:56:26 beast kernel: sda: Write Protect is off
> Mar 25 23:56:27 beast kernel: SCSI device sda: drive cache: write back
> Mar 25 23:56:27 beast kernel: ata3.00: exception Emask 0x0 SAct 0x3 SErr 0x0
> action 0x0
> Mar 25 23:56:28 beast kernel: ata3.00: irq_stat 0x40000008
> Mar 25 23:56:28 beast kernel: ata3.00: cmd
> 60/00:08:a5:3f:03/04:00:00:00:00/40 tag 1 ncq 524288
> in
> Mar 25 23:56:28 beast kernel:          res
> 41/40:00:a2:41:03/8c:00:00:00:00/40 Emask 0x409 (medi
> a error) <F>
> Mar 25 23:56:28 beast kernel: ata3.00: status: { DRDY ERR }
> Mar 25 23:56:28 beast kernel: ata3.00: error: { UNC }
> Mar 25 23:56:29 beast kernel: ata3.00: configured for UDMA/133
> Mar 25 23:56:29 beast kernel: ata3: EH complete
> Mar 25 23:56:29 beast kernel: SCSI device sda: 1953525168 512-byte hdwr
> sectors (1000205 MB)
> Mar 25 23:56:29 beast kernel: sda: Write Protect is off
> Mar 25 23:56:29 beast kernel: SCSI device sda: drive cache: write back
> Mar 25 23:56:34 beast kernel: md: md1: sync done.
> Mar 25 23:56:34 beast kernel: RAID5 conf printout:
> Mar 25 23:56:34 beast kernel:  --- rd:3 wd:3 fd:0
> Mar 25 23:56:34 beast kernel:  disk 0, o:1, dev:sda2
> Mar 25 23:56:34 beast kernel:  disk 1, o:1, dev:sdb2
> Mar 25 23:56:34 beast kernel:  disk 2, o:1, dev:sdd2
>
> i.e. a brief whinge about another of the discs in the RAID, while doing the
> resync. And this is repeatable. Now, is this simply a sign that I need a new
> disc, or is there something else funny going on? It's not as if either of
> the discs (the one I was re-adding or the one that had the UNC during the
> resync) is getting dropped from the array. But the one with the UNC does
> have one offline uncorrectable and two current pending sectors, according to
> smartctl.
>
> NB CentOS 5, 2.6.18-128.4.1.el5 kernel, mdadm 2.6.4. Probably time I updated
> a few packages.
>
> Cheers,
>
> John.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Niel, I'm not sure if this is good advice or not, since the data is
the same it may be cached.  However I propose:

1) resync the device (validate the reads are good)  -- scratch that
it's raid 5 and doesn't know to assign lesser trust to slower drives.

1) Unmount the filesystem in question (use a recover cd or usb drive whatever)
2) Determine your DATA stripe size, In this case it appears to be
(128K per drive? for 256K per stripe?) or 128K (per stripe)?
3) badblocks -b $((256*1024)) -n /dev/whatever

-n is non-destructive read-write; which should cause the entire device
contents to be read and safely re-written to the drives.  This should
cause the replacement of any pending sectors.

This is less optimal than just performing the desired operation on the
segment in question, but a LOT safer since the tools in question take
effort to make mistakes.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html