Re: Spares and partitioning huge disks

Derek Piper <derek.piper@xxxxxxxxx> · Fri, 14 Jan 2005 14:14:47 -0500

Ah, that does sound much better I agree ... having just been bitten by
the 'oh dear, I got one bit was out of place, bye bye disk' problem
myself.

Even if it only 'failed' a 'chunk', it would be an improvement. I'll
take 64K over 60GB any day. The read for the chunk could then be
calculated using parity and a notification sent upwards saying
something to this effect: 'uh, hey, I'm having to regenerate data from
disk N at area X on-the-fly (i.e. I'm 'degraded') but all disks are
still with us and the other data is not in harms way, you might want
to think about backups and possibly a new disk'. If the chunk/sector
(choose how much you want to fail) can then be read again, clear the
'alert'. Of course if you get two identical chunks that miss-read,
you're screwed. Probably less screwed than if it were whole disk
though.

Derek

On Fri, 14 Jan 2005 18:46:54 +0100, maarten <maarten@xxxxxxxxxxxx> wrote:
> 
> Mod parent "+5 Insightful".
> 
> Very well though out and said, Dieter.
> 
> Maarten
> 
> On Friday 14 January 2005 18:29, Dieter Stueken wrote:
> > Frank van Maarseveen wrote:
> 
> > > I did not intend to cut it out but simplified the situation a bit: if
> > > you have all the RAID5 disks even with a bunch of errors spread out over
> > > all of them then yes, you basically still have the data.  Nothing is
> > > lost provided there's no double fault and disks are not dead yet. But
> > > there are not many technical people I would trust for recovering from
> > > this situation. And I wouldn't trust myself without a significant
> > > coffee intake either :)
> >
> > I think read errors are to be handled very differently compared to disk
> > failures. In particular the affected disk should not be kicked out
> > incautious. If done so, you waste the real power of the RAID5 system
> > immediately! As long, as any other part of the disk can still be read,
> > this data must be preserved by all means. As long as only parts of a disk
> > (even of different disks) can't be read, it is not a fatal problem, as long
> > as the data can still be read from an other disk of the array. There is no
> > reason to kill any disk in advance.
> >
> > What I'm missing is some improved concept of replacing a disk:
> > Kicking off some disk at first and starting to resync to a spare
> > disk thereafter is a very dangerous approach. Instead some "presync"
> > should be possible: After a decision to replace some disk, the new
> > (spare) disk should be prepared in advance, while all other disks are still
> > running. After the spare disk was successfully prepared, the disk to
> > replace may be disabled.
> >
> > This sounds a bit like RAID6, but it is much simpler. The complicated part
> > may be the phase where I have one additional disk. A simple solution would
> > be to perform a resync offline, while no write takes place. This may even
> > be performed by a userland utility. If I want to perform the "presync"
> > online, I have to carry out writes to both disks simultaneously, while the
> > presync takes place.
> >
> > Dieter.
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Derek Piper - derek.piper@xxxxxxxxx
http://doofer.org/
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html