Re: Lose two disks during Raid 10 rebuild

NeilBrown <neilb@xxxxxxx> · Fri, 24 Aug 2012 07:07:18 +1000

On Thu, 23 Aug 2012 19:28:27 +0000 Steven La <Steven.La@xxxxxxxxxxxx> wrote:

> Hello all,
> 
> Got the following messages from syslog during Raid 10 rebuild cycle.
> 
> Aug  3 01:48:11 oak-sh283 kernel: sd 0:0:0:0: [sda] Unhandled sense code
> Aug  3 01:48:11 oak-sh283 kernel: sd 0:0:0:0: [sda] Result: hostbyte=invalid
> driverbyte=DRIVER_SENSE
> Aug  3 01:48:11 oak-sh283 kernel: sd 0:0:0:0: [sda] Sense Key : Medium Error
> [current]

"Medium Error" normally means that the recording medium (magnetic regions) is
corrupt in some way and a valid data block cannot be extracted.

> Aug  3 01:48:11 oak-sh283 kernel: Info fld=0x3ae0f43c
> Aug  3 01:48:11 oak-sh283 kernel: sd 0:0:0:0: [sda] Add. Sense: Unrecovered
> read error
> Aug  3 01:48:11 oak-sh283 kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 3a e0
> f3 ab 00 01 00 00
> Aug  3 01:48:11 oak-sh283 kernel: end_request: I/O error, dev sda, sector
> 987821116
> Aug  3 01:48:11 oak-sh283 kernel: md/raid10:md7: Disk failure on sda8,
> disabling device.
> Aug  3 01:48:11 oak-sh283 kernel: md/raid10:md7: Operation continuing on 2
> devices.
> Aug  3 01:48:11 oak-sh283 kernel: md: md7: recovery done.
> Aug  3 01:48:11 oak-sh283 kernel: md/raid10:md7: Disk failure on sdc8,
> disabling device.

Presumably md7 was trying to recover sdc8 from sda8.  It got a data error on
sda8, so could not recover sda8 and so marked it as failed.

> Aug  3 01:48:11 oak-sh283 kernel: md/raid10:md7: Operation continuing on 2
> devices.
> Aug  3 01:48:14 oak-sh283 kernel: md: unbind<sdc8>
> Aug  3 01:48:14 oak-sh283 kernel: md: export_rdev(sdc8)
> Aug  3 01:48:14 oak-sh283 kernel: md: unbind<sda8>
> Aug  3 01:48:14 oak-sh283 kernel: md: export_rdev(sda8)
> Aug  3 01:48:16 oak-sh283 raid_rebuild: Sending sighup to hald[22152] for event
> RebuildFinished for /dev/md7
> 
> 
> [admin@oak-sh283 ~]# cat /proc/mdstat
> 
> Personalities : [linear] [raid0] [raid1] [raid10]
> 
> md5 : active raid10 sdc9[1] sde9[2] sdg9[3] sda9[0]
> 
>       562997760 blocks 64K chunks 2 near-copies [4/4] [UUUU]
> 
> 
> 
> md7 : active raid10 sde8[2] sdg8[3]
> 
>       562997760 blocks 64K chunks 2 near-copies [4/2] [__UU]
> 
> 
> 
> md6 : active raid10 sdc7[1] sde7[2] sdg7[3] sda7[0]
> 
>       562997760 blocks 64K chunks 2 near-copies [4/4] [UUUU]
> 
> 
> 
> md3 : active raid10 sdc6[1] sde6[2] sdg6[3] sda6[0]
> 
>       52435968 blocks 64K chunks 2 near-copies [4/4] [UUUU]
> 
> 
> 
> md0 : active raid10 sdc2[1] sde2[2] sdg2[3] sda2[0]
> 
>       10490240 blocks 64K chunks 2 near-copies [4/4] [UUUU]
> 
> 
> 
> md4 : active raid10 sdb3[0] sdh3[3] sdf3[2] sdd3[1]
> 
>       19518720 blocks 64K chunks 2 near-copies [4/4] [UUUU]
> 
> 
> 
> md2 : active raid10 sdc3[1] sde3[2] sdg3[3] sda3[0]
> 
>       67119360 blocks 64K chunks 2 near-copies [4/4] [UUUU]
> 
> 
> 
> md1 : active raid10 sdc5[1] sde5[2] sdg5[3] sda5[0]
> 
>       134222848 blocks 64K chunks 2 near-copies [4/4] [UUUU]
> 
> 
> 
> 
> 
> >From the error message below (also shown above), the block that cannot be read from sda
> 
> has lba=0x3ae0f3ab.
> 
> 
> 
> Aug  3 01:48:11 oak-sh283 kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 3a e0
> 
> f3 ab 00 01 00 00
> 
> 
> [admin@oak-sh283 ~]# fdisk -s /dev/sda
> 
> 976762584

This number is in kilobytes. 1 TB.

> 
> 
> 
> The last block on the drive is 0x3a3836d8

This is a sector number. 976762584 sectors or 500102443008 bytes into the
device.  About half way.

You can probably correct the bad sector by
 dd if=/dev/zero of=/dev/sda seek=976762584 count=1 oflag=direct

I would try to read from the address first yo make sure it is in error:

 dd of=/dev/null if=/dev/sda skip=976762584 count=1 oflag=direct

Then read the entire device to ensure there are no other media errors.
Then stop the array and re-assemble with --force.
Then try the recovery again.

NeilBrown

> 
> 
> 
> (gdb) p/x 976762584
> 
> $1 = 0x3a3836d8
> 
> (gdb) p 0x3ae0f3ab
> 
> $2 = 987820971
> 
> So, it seems like the lba number used in the Read(10) command has exceeded the last block of the drive.
> Has anyone had this problem before? What else can I look at?
> 
> Relevant info are shown below,
> 
> [admin@oak-sh283 ~]# mdadm -V
> mdadm - v2.6.4 - 19th October 2007
> 
> [admin@oak-sh283 ~]# uname -a
> Linux oak-sh283 2.6.32 #1 SMP Wed Aug 1 01:38:35 PDT 2012 x86_64 x86_64 x86_64 GNU/Linux
> 
> Thanks and regards,
> --Steven
> 
> 
> 
> 
> 

Attachment:
signature.asc

Description: PGP signature