To me, things do not look good for a quick fix. It kinda looks like you
killed it. Any info about the details of how things died, and exactly
what you did after things atarted going south? What are you using for a
controller? It sounds like it is ready for the dump. Any messages from
the controller, itself?
b-
Kyler Laird wrote:
Recently a drive failed on one of our file servers. The machine has
three RAID6 arrays (15 1TB each plus spares). I let the spare rebuild
and then started the process of replacing the drive.
Unfortunately I'd misplaced the list of drive IDs so I generated a new
list in order to identify the failed drive. I used "smartctl" and made
a quick script to scan all 48 drives and generate pretty output. That
was a mistake. After running it a couple times one of the controllers
failed and several disks in the first array were failed.
I worked on the machine for awhile. (It has an NFS root.) I got some
information from it before it rebooted (via watchdog). I've dumped all
of the information here.
http://lairds.us/temp/ucmeng_md/
In mdstat_0 you can see the status of the arrays right after the
controller failure. mdstat_1 shows the status after reboot.
sys_block shows a listing of the block devices so you can see that the
problem drives are on controller 1.
The examine_sd?1 files show -E output from each drive in md0. Note that
the Events count is different for the drives on the problem controller.
I'd like to know if this is something I can recover. I do have backups
but it's a huge pain to recover this much data.
Thank you.
--kyler
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html