RAID5 - Disk failed during re-shape

Sam Clark <sclark_77@xxxxxxxxxxx> · Thu, 9 Aug 2012 10:38:05 +0200

Hi All, 

Hoping you can help recover my data!

I have (had?) a software RAID 5 volume, created on Ubuntu 10.04 a few years
back consisting of 4 x 1500GB drives.  Was running great until the
motherboard died last week.   Purchased new motherboard, CPU & RAM,
installed Ubuntu 12.04, and got everything assembled fine, and working for
around 48 hours.  

After that I added a 2000GB drive to increase capacity, and ran mdadm --add
/dev/md0 /dev/sdf.  The Re-configuration started to run, and then around
11.4% of the reshaping I saw that the server had some errors:
Aug  8 22:17:41 nas kernel: [ 5927.453434] Buffer I/O error on device md0,
logical block 715013760
Aug  8 22:17:41 nas kernel: [ 5927.453439] EXT4-fs warning (device md0):
ext4_end_bio:251: I/O error writing to inode 224003641 (offset 157810688
size 4096 starting block 715013760)
Aug  8 22:17:41 nas kernel: [ 5927.453448] JBD2: Detected IO errors while
flushing file data on md0-8
Aug  8 22:17:41 nas kernel: [ 5927.453467] Aborting journal on device md0-8.
Aug  8 22:17:41 nas kernel: [ 5927.453642] Buffer I/O error on device md0,
logical block 548962304
Aug  8 22:17:41 nas kernel: [ 5927.453643] lost page write due to I/O error
on md0
Aug  8 22:17:41 nas kernel: [ 5927.453656] JBD2: I/O error detected when
updating journal superblock for md0-8.
Aug  8 22:17:41 nas kernel: [ 5927.453688] Buffer I/O error on device md0,
logical block 0
Aug  8 22:17:41 nas kernel: [ 5927.453690] lost page write due to I/O error
on md0
Aug  8 22:17:41 nas kernel: [ 5927.453697] EXT4-fs error (device md0):
ext4_journal_start_sb:327: Detected aborted journal
Aug  8 22:17:41 nas kernel: [ 5927.453700] EXT4-fs (md0): Remounting
filesystem read-only
Aug  8 22:17:41 nas kernel: [ 5927.453703] EXT4-fs (md0): previous I/O error
to superblock detected
Aug  8 22:17:41 nas kernel: [ 5927.453826] Buffer I/O error on device md0,
logical block 715013760
Aug  8 22:17:41 nas kernel: [ 5927.453828] lost page write due to I/O error
on md0
Aug  8 22:17:41 nas kernel: [ 5927.453842] JBD2: Detected IO errors while
flushing file data on md0-8
Aug  8 22:17:41 nas kernel: [ 5927.453848] Buffer I/O error on device md0,
logical block 0
Aug  8 22:17:41 nas kernel: [ 5927.453850] lost page write due to I/O error
on md0
Aug  8 22:20:54 nas kernel: [ 6120.964129] INFO: task md0_reshape:297
blocked for more than 120 seconds.

On checking the progress of /proc/mdstat, I found that 2 drives were listed
as failed (__UUU), and the finish time was simply growing by hundreds of
minutes at a time.

I was able to browse some data on the Raid set (incl my Home folder), but
couldn't browse some other sections - shell simply hung when I tried to
issue "ls /raidmount".  I tied to add one of the failed disks back in, but
got the response that there was no superblock on it.  rebooted it at that
time.

During boot I was given the option to manually recover, or skip mounting - I
chose the latter. 

Now that the system is running, I tried to assemble, but keeps failing. 
Have tried:
mdadm --assemble --force /dev/md0 /dev/sdb /dev/sdc /dev/sdd /dev/sde
/dev/sdf

I am able to see all the drives, but can see the UUID is incorrect and the
Raid Level states -unknown-, as below... does this mean the data can't be
recovered?  

root@nas:/var/log$ mdadm --examine /dev/sd[b-f]
/dev/sdb:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 00000000:00000000:00000000:00000000
  Creation Time : Thu Aug  9 07:44:48 2012
     Raid Level : -unknown-
   Raid Devices : 0
  Total Devices : 5
Preferred Minor : 0

    Update Time : Thu Aug  9 07:45:10 2012
          State : active
Active Devices : 0
Working Devices : 5
Failed Devices : 0
  Spare Devices : 5
       Checksum : a0b6e863 - correct
         Events : 1

      Number   Major   Minor   RaidDevice State
this     0       8       16        0      spare   /dev/sdb

   0     0       8       16        0      spare   /dev/sdb
   1     1       8       48        1      spare   /dev/sdd
   2     2       8       80        2      spare   /dev/sdf
   3     3       8       64        3      spare   /dev/sde
   4     4       8       32        4      spare   /dev/sdc
/dev/sdc:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 00000000:00000000:00000000:00000000
  Creation Time : Thu Aug  9 07:44:48 2012
     Raid Level : -unknown-
   Raid Devices : 0
  Total Devices : 5
Preferred Minor : 0

    Update Time : Thu Aug  9 07:45:10 2012
          State : active
Active Devices : 0
Working Devices : 5
Failed Devices : 0
  Spare Devices : 5
       Checksum : a0b6e87b - correct
         Events : 1

      Number   Major   Minor   RaidDevice State
this     4       8       32        4      spare   /dev/sdc

   0     0       8       16        0      spare   /dev/sdb
   1     1       8       48        1      spare   /dev/sdd
   2     2       8       80        2      spare   /dev/sdf
   3     3       8       64        3      spare   /dev/sde
   4     4       8       32        4      spare   /dev/sdc
/dev/sdd:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 00000000:00000000:00000000:00000000
  Creation Time : Thu Aug  9 07:44:48 2012
     Raid Level : -unknown-
   Raid Devices : 0
  Total Devices : 5
Preferred Minor : 0

    Update Time : Thu Aug  9 07:45:10 2012
          State : active
Active Devices : 0
Working Devices : 5
Failed Devices : 0
  Spare Devices : 5
       Checksum : a0b6e885 - correct
         Events : 1

      Number   Major   Minor   RaidDevice State
this     1       8       48        1      spare   /dev/sdd

   0     0       8       16        0      spare   /dev/sdb
   1     1       8       48        1      spare   /dev/sdd
   2     2       8       80        2      spare   /dev/sdf
   3     3       8       64        3      spare   /dev/sde
   4     4       8       32        4      spare   /dev/sdc
/dev/sde:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 00000000:00000000:00000000:00000000
  Creation Time : Thu Aug  9 07:44:48 2012
     Raid Level : -unknown-
   Raid Devices : 0
  Total Devices : 5
Preferred Minor : 0

    Update Time : Thu Aug  9 07:45:10 2012
          State : active
Active Devices : 0
Working Devices : 5
Failed Devices : 0
  Spare Devices : 5
       Checksum : a0b6e899 - correct
         Events : 1

      Number   Major   Minor   RaidDevice State
this     3       8       64        3      spare   /dev/sde

   0     0       8       16        0      spare   /dev/sdb
   1     1       8       48        1      spare   /dev/sdd
   2     2       8       80        2      spare   /dev/sdf
   3     3       8       64        3      spare   /dev/sde
   4     4       8       32        4      spare   /dev/sdc
/dev/sdf:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 00000000:00000000:00000000:00000000
  Creation Time : Thu Aug  9 07:44:48 2012
     Raid Level : -unknown-
   Raid Devices : 0
  Total Devices : 5
Preferred Minor : 0

    Update Time : Thu Aug  9 07:45:10 2012
          State : active
Active Devices : 0
Working Devices : 5
Failed Devices : 0
  Spare Devices : 5
       Checksum : a0b6e8a7 - correct
         Events : 1

      Number   Major   Minor   RaidDevice State
this     2       8       80        2      spare   /dev/sdf

   0     0       8       16        0      spare   /dev/sdb
   1     1       8       48        1      spare   /dev/sdd
   2     2       8       80        2      spare   /dev/sdf
   3     3       8       64        3      spare   /dev/sde
   4     4       8       32        4      spare   /dev/sdc

According to Syslog, the only drive failure that I had was /dev/sde, but I
guess that the re-shape has caused this to go awry.
syslog.1:Aug  8 22:17:41 nas mdadm[1029]: Fail event detected on md device
/dev/md0, component device /dev/sde

I tried removing the /etc/mdadm/mdadm.conf file, and re-running the scan,
where I got:
root@nas:/var/log$ sudo mdadm --assemble --scan -f -vv
mdadm: looking for devices for further assembly
mdadm: cannot open device /dev/sda5: Device or resource busy
mdadm: no RAID superblock on /dev/sda2
mdadm: cannot open device /dev/sda1: Device or resource busy
mdadm: cannot open device /dev/sda: Device or resource busy
mdadm: /dev/sdf is identified as a member of /dev/md/0_0, slot 2.
mdadm: /dev/sde is identified as a member of /dev/md/0_0, slot 3.
mdadm: /dev/sdd is identified as a member of /dev/md/0_0, slot 1.
mdadm: /dev/sdc is identified as a member of /dev/md/0_0, slot 4.
mdadm: /dev/sdb is identified as a member of /dev/md/0_0, slot 0.
mdadm: failed to add /dev/sdd to /dev/md/0_0: Invalid argument
mdadm: failed to add /dev/sdf to /dev/md/0_0: Invalid argument
mdadm: failed to add /dev/sde to /dev/md/0_0: Invalid argument
mdadm: failed to add /dev/sdc to /dev/md/0_0: Invalid argument
mdadm: failed to add /dev/sdb to /dev/md/0_0: Invalid argument
mdadm: /dev/md/0_0 assembled from -1 drives and 1 spare - not enough to
start the array.
mdadm: looking for devices for further assembly
mdadm: No arrays found in config file or automatically

I guess the 'invalid argument' is the -unknown- in the raid level.. but it's
only a guess. 

I'm at the extent of my knowledge - would appreciate some expert assistance
in recovering this array, if it's possible!

Many thanks, 
Sam

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html