Hi, I'm writing to tell about an experience I recently had with my raid
5 array. No data was lost, but the way I had to recover my data seemed
a bit overly complicated.
Here's what happened:
1) 6 drive raid 5 array
2) one drive failed
3) I add a spare drive to the array, resync process starts
4) Resync process bugs out at 90% orso because one of the other drives
had developed a read-error (since the array is big, such things are
easily unnoticed).
5) Raid array drops out the drive with read-errors and is left in a
broken state, the new drive was not fully resynced yet...
How I recovered it:
6) Re-add the drive with read-errors (need to stop array, re-assemble it
with --force option, can't do it directly)
7) Copy everything on the array to some other array
8) When it bugged out again, I noted the file that caused problems,
repeated step 6 and copied all the rest.
This seems to be a bit of a round-about way to do it. I would much have
preferred if I could force the resync process to continue despite some
block being unreadable (and just log which blocks were causing problems
if forced in such a way). I could then proceed to figure out which
files were affected, and replace the drive with read errors as well.
Perhaps I missed something... I was considering just using dd to
overwrite the block causing problems (and hopefully get it to be
remapped), but I'm not 100% sure how LBA block numbers reported by
S.M.A.R.T. would related to block numbers dd uses.
If I could have handled it better, let me know :)
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html