Re: RAID5 - Disk failed during re-shape

Phil Turmel <philip@xxxxxxxxxx> · Fri, 10 Aug 2012 18:36:26 -0400

Hi Sam,

On 08/09/2012 04:38 AM, Sam Clark wrote:
> Hi All, 
> 
> Hoping you can help recover my data!
> 
> I have (had?) a software RAID 5 volume, created on Ubuntu 10.04 a few years
> back consisting of 4 x 1500GB drives.  Was running great until the
> motherboard died last week.   Purchased new motherboard, CPU & RAM,
> installed Ubuntu 12.04, and got everything assembled fine, and working for
> around 48 hours.  

Uh-oh.  Stock 12.04 has a buggy kernel.  See here:
http://neil.brown.name/blog/20120615073245

> After that I added a 2000GB drive to increase capacity, and ran mdadm --add
> /dev/md0 /dev/sdf.  The Re-configuration started to run, and then around
> 11.4% of the reshaping I saw that the server had some errors:

And you reshaped and got media errors ...

> Aug  8 22:17:41 nas kernel: [ 5927.453434] Buffer I/O error on device md0,
> logical block 715013760
> Aug  8 22:17:41 nas kernel: [ 5927.453439] EXT4-fs warning (device md0):
> ext4_end_bio:251: I/O error writing to inode 224003641 (offset 157810688
> size 4096 starting block 715013760)
> Aug  8 22:17:41 nas kernel: [ 5927.453448] JBD2: Detected IO errors while
> flushing file data on md0-8
> Aug  8 22:17:41 nas kernel: [ 5927.453467] Aborting journal on device md0-8.
> Aug  8 22:17:41 nas kernel: [ 5927.453642] Buffer I/O error on device md0,
> logical block 548962304
> Aug  8 22:17:41 nas kernel: [ 5927.453643] lost page write due to I/O error
> on md0
> Aug  8 22:17:41 nas kernel: [ 5927.453656] JBD2: I/O error detected when
> updating journal superblock for md0-8.
> Aug  8 22:17:41 nas kernel: [ 5927.453688] Buffer I/O error on device md0,
> logical block 0
> Aug  8 22:17:41 nas kernel: [ 5927.453690] lost page write due to I/O error
> on md0
> Aug  8 22:17:41 nas kernel: [ 5927.453697] EXT4-fs error (device md0):
> ext4_journal_start_sb:327: Detected aborted journal
> Aug  8 22:17:41 nas kernel: [ 5927.453700] EXT4-fs (md0): Remounting
> filesystem read-only
> Aug  8 22:17:41 nas kernel: [ 5927.453703] EXT4-fs (md0): previous I/O error
> to superblock detected
> Aug  8 22:17:41 nas kernel: [ 5927.453826] Buffer I/O error on device md0,
> logical block 715013760
> Aug  8 22:17:41 nas kernel: [ 5927.453828] lost page write due to I/O error
> on md0
> Aug  8 22:17:41 nas kernel: [ 5927.453842] JBD2: Detected IO errors while
> flushing file data on md0-8
> Aug  8 22:17:41 nas kernel: [ 5927.453848] Buffer I/O error on device md0,
> logical block 0
> Aug  8 22:17:41 nas kernel: [ 5927.453850] lost page write due to I/O error
> on md0
> Aug  8 22:20:54 nas kernel: [ 6120.964129] INFO: task md0_reshape:297
> blocked for more than 120 seconds.
> 
> On checking the progress of /proc/mdstat, I found that 2 drives were listed
> as failed (__UUU), and the finish time was simply growing by hundreds of
> minutes at a time.
> 
> I was able to browse some data on the Raid set (incl my Home folder), but
> couldn't browse some other sections - shell simply hung when I tried to
> issue "ls /raidmount".  I tied to add one of the failed disks back in, but
> got the response that there was no superblock on it.  rebooted it at that
> time.

Poof.  The bug wiped your active device's metadata.

> During boot I was given the option to manually recover, or skip mounting - I
> chose the latter. 

Good instincts, but probably not any help.

> Now that the system is running, I tried to assemble, but keeps failing. 
> Have tried:
> mdadm --assemble --force /dev/md0 /dev/sdb /dev/sdc /dev/sdd /dev/sde
> /dev/sdf
> 
> I am able to see all the drives, but can see the UUID is incorrect and the
> Raid Level states -unknown-, as below... does this mean the data can't be
> recovered?  

If you weren't in the middle of a reshape, you could recover using the
instructions in the blog entry above.

[trim /]

> I guess the 'invalid argument' is the -unknown- in the raid level.. but it's
> only a guess. 
> 
> I'm at the extent of my knowledge - would appreciate some expert assistance
> in recovering this array, if it's possible!

I think you are toast, as I saw nothing in the metadata that would give
you a precise reshape restart position, even if you got Neil to work up
a custom mdadm that could use it.  The 11.4% could be converted into an
approximate restart position, perhaps.

Neil, is there any way to do some combination of "create
--assume-clean", start a reshape held at zero, then skip 11.4% ?

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html