Bad drive discovered during raid5 reshape

Kyle Stuart <kstuart@xxxxxxxxx> · Mon, 29 Oct 2007 01:10:46 -0600

Hi,
I bought two new hard drives to expand my raid array today and
unfortunately one of them appears to be bad. The problem didn't arise
until after I attempted to grow the raid array. I was trying to expand
the array from 6 to 8 drives. I added both drives using mdadm --add
/dev/md1 /dev/sdb1 which completed, then mdadm --add /dev/md1 /dev/sdc1
which also completed. I then ran mdadm --grow /dev/md1 --raid-devices=8.
It passed the critical section, then began the grow process.

After a few minutes I started to hear unusual sounds from within the
case. Fearing the worst I tried to cat /proc/mdstat which resulted in no
output so I checked dmesg which showed that /dev/sdb1 was not working
correctly. After several minutes dmesg indicated that mdadm gave up and
the grow process stopped. After googling around I tried the solutions
that seemed most likely to work, including removing the new drives with
mdadm --remove --force /dev/md1 /dev/sd[bc]1 and rebooting after which I
ran mdadm -Af /dev/md1. The grow process restarted then failed almost
immediately. Trying to mount the drive gives me a reiserfs replay
failure and suggests running fsck. I don't dare fsck the array since
I've already messed it up so badly. Is there any way to go back to the
original working 6 disc configuration with minimal data loss? Here's
where I'm at right now, please let me know if I need to include any
additional information.

# uname -a
Linux nas 2.6.22-gentoo-r5 #1 SMP Thu Aug 23 16:59:47 MDT 2007 x86_64
AMD Athlon(tm) 64 Processor 3500+ AuthenticAMD GNU/Linux

# mdadm --version
mdadm - v2.6.2 - 21st May 2007

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md1 : active raid5 hdb1[0] sdb1[8](F) sda1[5] sdf1[4] sde1[3] sdg1[2]
sdd1[1]
      1220979520 blocks super 0.91 level 5, 64k chunk, algorithm 2 [8/6]
[UUUUUU__]

unused devices: <none>

# mdadm --detail --verbose /dev/md1
/dev/md1:
        Version : 00.91.03
  Creation Time : Sun Apr  8 19:48:01 2007
     Raid Level : raid5
     Array Size : 1220979520 (1164.42 GiB 1250.28 GB)
  Used Dev Size : 244195904 (232.88 GiB 250.06 GB)
   Raid Devices : 8
  Total Devices : 7
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Mon Oct 29 00:53:21 2007
          State : clean, degraded
 Active Devices : 6
Working Devices : 6
 Failed Devices : 1
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

  Delta Devices : 2, (6->8)

           UUID : 56e7724e:9a5d0949:ff52889f:ac229049
         Events : 0.487460

    Number   Major   Minor   RaidDevice State
       0       3       65        0      active sync   /dev/hdb1
       1       8       49        1      active sync   /dev/sdd1
       2       8       97        2      active sync   /dev/sdg1
       3       8       65        3      active sync   /dev/sde1
       4       8       81        4      active sync   /dev/sdf1
       5       8        1        5      active sync   /dev/sda1
       6       0        0        6      removed
       8       8       17        7      faulty spare rebuilding   /dev/sdb1

#dmesg
<snip>
md: md1 stopped.
md: unbind<hdb1>
md: export_rdev(hdb1)
md: unbind<sdc1>
md: export_rdev(sdc1)
md: unbind<sdb1>
md: export_rdev(sdb1)
md: unbind<sda1>
md: export_rdev(sda1)
md: unbind<sdf1>
md: export_rdev(sdf1)
md: unbind<sde1>
md: export_rdev(sde1)
md: unbind<sdg1>
md: export_rdev(sdg1)
md: unbind<sdd1>
md: export_rdev(sdd1)
md: bind<sdd1>
md: bind<sdg1>
md: bind<sde1>
md: bind<sdf1>
md: bind<sda1>
md: bind<sdb1>
md: bind<sdc1>
md: bind<hdb1>
md: md1 stopped.
md: unbind<hdb1>
md: export_rdev(hdb1)
md: unbind<sdc1>
md: export_rdev(sdc1)
md: unbind<sdb1>
md: export_rdev(sdb1)
md: unbind<sda1>
md: export_rdev(sda1)
md: unbind<sdf1>
md: export_rdev(sdf1)
md: unbind<sde1>
md: export_rdev(sde1)
md: unbind<sdg1>
md: export_rdev(sdg1)
md: unbind<sdd1>
md: export_rdev(sdd1)
md: bind<sdd1>
md: bind<sdg1>
md: bind<sde1>
md: bind<sdf1>
md: bind<sda1>
md: bind<sdb1>
md: bind<sdc1>
md: bind<hdb1>
md: kicking non-fresh sdc1 from array!
md: unbind<sdc1>
md: export_rdev(sdc1)
raid5: reshape will continue
raid5: device hdb1 operational as raid disk 0
raid5: device sdb1 operational as raid disk 7
raid5: device sda1 operational as raid disk 5
raid5: device sdf1 operational as raid disk 4
raid5: device sde1 operational as raid disk 3
raid5: device sdg1 operational as raid disk 2
raid5: device sdd1 operational as raid disk 1
raid5: allocated 8462kB for md1
raid5: raid level 5 set md1 active with 7 out of 8 devices, algorithm 2
RAID5 conf printout:
 --- rd:8 wd:7
 disk 0, o:1, dev:hdb1
 disk 1, o:1, dev:sdd1
 disk 2, o:1, dev:sdg1
 disk 3, o:1, dev:sde1
 disk 4, o:1, dev:sdf1
 disk 5, o:1, dev:sda1
 disk 7, o:1, dev:sdb1
...ok start reshape thread
md: reshape of RAID array md1
md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 200000
KB/sec) for reshape.
md: using 128k window, over a total of 244195904 blocks.
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata2.00: cmd 35/00:00:3f:42:02/00:04:00:00:00/e0 tag 0 cdb 0x0 data
524288 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata2: port is slow to respond, please be patient (Status 0xd8)
ata2: device not ready (errno=-16), forcing hardreset
ata2: hard resetting port
<repeats 4 more times>
ata2: reset failed, giving up
ata2.00: disabled
ata2: EH complete
sd 1:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET
driverbyte=DRIVER_OK,SUGGEST_OK
end_request: I/O error, dev sdb, sector 148031
raid5: Disk failure on sdb1, disabling device. Operation continuing on 6
devices
sd 1:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET
driverbyte=DRIVER_OK,SUGGEST_OK
end_request: I/O error, dev sdb, sector 149055
sd 1:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET
driverbyte=DRIVER_OK,SUGGEST_OK
end_request: I/O error, dev sdb, sector 149439
md: md1: reshape done.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html