Hi Brian and Stefan,
Thanks for your reply. I checked the status of the array after the rebuild (and before the reset).
md0 : active raid6 sdd1[8] sdc1[4] sda1[3] sdb1[7] sdi1[5] sde1[1]
14650667520 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [UUUUUU_]
However given that I've never had any problems before with mdadm rebuilds I did not think to check the data before rebooting. Note that the array is still in this state. Before the reboot I tried to run a smartctl check on the failed drives and it could not read them. When I rebooted I did not actually replace any drives, I just power cycled to see if I could re-access the drives that were thrown out of the array. According to smartctl they are completely fine.
I guess there is no way I can re-add the old drives and remove the newly synced drive? Even though I immediately kicked all users off the system when I got the mdadm alert, it's possible a small amount of data was written to the array during the resync.
It looks like the filesystem was not unmounted properly before reboot:
Jan 06 09:11:54 server systemd[1]: Failed unmounting /export/data.
Jan 06 09:11:54 server systemd[1]: Shutting down.
Here is the mount errors in the log after rebooting:
Jan 06 09:15:17 server kernel: XFS (md0): Mounting Filesystem
Jan 06 09:15:17 server kernel: XFS (md0): Corruption detected. Unmount and run xfs_repair
Jan 06 09:15:17 server kernel: XFS (md0): Corruption detected. Unmount and run xfs_repair
Jan 06 09:15:17 server kernel: XFS (md0): Corruption detected. Unmount and run xfs_repair
Jan 06 09:15:17 server kernel: XFS (md0): metadata I/O error: block 0x400 ("xfs_trans_read_buf_map") error 117 numblks 16
Jan 06 09:15:17 server kernel: XFS (md0): xfs_imap_to_bp: xfs_trans_read_buf() returned error 117.
Jan 06 09:15:17 server kernel: XFS (md0): failed to read root inode
xfs_repair -n -L also complains about a bad magic number.
Unfortunately this 15TB RAID was part of a 45TB GlusterFS distributed volume. It was only ever meant to be a scratch drive for intermediate scientific results, however inevitably most users used it to store lots of data. Oh well.
Thanks again,
Dave
On 6 January 2015 at 23:47, Brian Foster <bfoster@xxxxxxxxxx> wrote:
On Tue, Jan 06, 2015 at 05:12:14PM +1100, David Raffelt wrote:
> Hi again,
> Some more information.... the kernel log show the following errors were
> occurring after the RAID recovery, but before I reset the server.
>
By after the raid recovery, you mean after the two drives had failed out
and 1 hot spare was activated and resync completed? It certainly seems
like something went wrong in this process. The output below looks like
it's failing to read in some inodes. Is there any stack trace output
that accompanies these error messages to confirm?
I suppose I would try to verify that the array configuration looks sane,
but after the hot spare resync and then one or two other drive
replacements (was the hot spare ultimately replaced?), it's hard to say
whether it might be recoverable.
Brian
> Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected. Unmount and
> run xfs_repair
> Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected. Unmount and
> run xfs_repair
> Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected. Unmount and
> run xfs_repair
> Jan 06 00:00:27 server kernel: XFS (md0): metadata I/O error: block
> 0x36b106c00 ("xfs_trans_read_buf_map") error 117 numblks 16
> Jan 06 00:00:27 server kernel: XFS (md0): xfs_imap_to_bp:
> xfs_trans_read_buf() returned error 117.
>
>
> Thanks,
> Dave
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs
David Raffelt (PhD)
Postdoctoral Fellow
The Florey Institute of Neuroscience and Mental Health
Melbourne Brain Centre - Austin Campus
245 Burgundy Street
Heidelberg Vic 3084
Ph: +61 3 9035 7024
_______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs