Re: SMART, RAID and real world experience of failures.

Steven Haigh <netwiz@xxxxxxxxx> · Fri, 06 Jan 2012 22:40:28 +1100

On 6/01/2012 10:22 PM, Peter Grandi wrote:
[ ... ]

I got a SMART error email yesterday from my home server with a
4 x 1Tb RAID6. [ ... ]

That's an (euphemism alert) imaginative setup. Why not a 4 drive
RAID10? In general there are vanishingly few cases in which
RAID6 makes sense, and in the 4 drive case a RAID10 makes even
more sense than usual. Especially with the really cool setup
options that MD RAID10 offers.

The main reason is the easy ability to grow the RAID6 to an extra drive 
when I need the space. I've just about allocated all of the array to 
various VMs and file storage. One thats full, its easier to add another 
1Tb drive, grow the RAID, grow the PV and then either add more LVs or 
grow the ones that need it. Sadly, I don't have the cash flow to just 
replace the 1Tb drives with 2Tb drives or whatever the flavour of the 
month is after 2 years.

This makes me ponder. Has the drive recovered? Has the sector
with the read failure been remapped and hidden from view? Is
it still (more?)  likely to fail in the near future?

Uhmmm, slightly naive questions. A 1TB drive has almost 2
billion sectors, so "bad" sectors should be common.

But the main point is that what is a "bad" sector is a messy
story, and most "bad" sectors are really marginal (and an
argument can be made that most sectors are marginal or else PRML
encoding would not be necessary). So many things can go wrong,
and not all fatally. For example when writing some "bad" sectors
the drive was vibrating a bit more and the head was accordingly
a little bit off, etc.

Writing-over some marginal sectors often refreshes the
recording, and it is no longer marginal, and otherwise as you
guessed the drive can substitute the sector with a spare
(something that it cannot really do on reading of course).

This is what I was wondering... The drive has been running for about 1.9 
years - pretty much 24/7. From checking the seagate web site, its still 
under warranty until the end of 2012.

I guess it seems that the best thing to do is monitor the drive as I 
have been doing and see if its a once off or becomes a regular 
occurrence. My system does a check of the RAID every week as part of the 
cron setup, so I'd hope things like this get picked up before it starts 
losing any redundancy.

--
Steven Haigh

Email: netwiz@xxxxxxxxx
Web: http://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html