Re: raid1 issue after disk failure: both disks of the array are still active

Robin Hill <robin@xxxxxxxxxxxxxxx> · Thu, 13 Sep 2012 11:34:32 +0100

On Thu Sep 13, 2012 at 12:01:59PM +0200, Niccolò Belli wrote:

> Hi,
> I have a raid1 array with two disks, distro is Squeeze amd64. /dev/sda 
> is slowly dying, here is a snippet of "smartctl -a /dev/sda":
> 
> 197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always 
>        -       2
> 198 Offline_Uncorrectable   0x0030   100   100   000    Old_age Offline 
>       -       1
> 
> The bad sector is in the second half-MB of the disk, in fact with "dd 
> if=/dev/sda1 of=/dev/null bs=524228 count=1 skip=1" I get this output in 
> /var/log/syslog:
> 
> root@asterisk:~# dd if=/dev/sda1 of=/dev/null bs=524228 count=1 skip=1
> 0+1 record dentro
> 0+1 record fuori
> 430140 byte (430 kB) copiati, 11,7265 s, 36,7 kB/s
> 
<- snip dmesg output ->
> 
> *Why doesn't it fail the first hard disk of the array!!??*
> 
Has anything actually attempted to read from that part of the array?
Even if so, it may just have happened to read from the working disk
anyway. md can only detect the error when it tries to read/write that
sector of that disk.

Your best bet now is to do an array check:
    echo check > /sys/block/md0/md/sync_action

This will force a read of all disks in the array. This should trigger
the read error, causing an attempt to re-write the faulty block, in turn
causing the drive remap the bad sector (assuming the re-write fails).
This should also be scheduled to run regularly for all arrays in order
to pick up these sort of issues before they cause major problems during
a rebuild.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@xxxxxxxxxxxxxxx> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |
Attachment:
pgpJazRjBYOHR.pgp

Description: PGP signature