Re: upgrade advice

Bill Davidsen <davidsen@xxxxxxx> · Mon, 12 Jan 2009 22:00:16 -0500

Martin K. Petersen wrote:
People always seem to assume that hardware is what's making the
difference between consumer and enterprise.  It's not.  The physical
hardware differs mostly due to capacity vs. RPM trade-offs.  Most
vendors these days have big platters for high-capacity drives and
smaller platters for high RPM/higher IOPS class drives.  On top of
that, head/platter count may vary in capacity classes within a series.

But the important difference between consumer and enterprise drives is
not mechanical.  It's the firmware.  Consumer drive firmware is about
squeezing out the most capacity/$ and nothing else.

That's simply not the case. Cost is one of the issues, but the typical 
use of the drive is one of the most important things about the firmware. 
With consumer drives, it's likely that this is the one and only drive 
holding the data, so clever retries in case of error are important. With 
server grade drives, it's likely they are in a RAID, so returning the 
error quickly so the controller or OS can compensate is the important 
issue. That's been discussed here before, some drives even give you a 
choice of "don't hang up" vs. "try like hell" on errors, through jumpers 
or firmware.

Enterprise drives trade capacity for reliability by way of the
firmware.  That includes many things like using more space for track
info (gap/sync), much better ECC, better tolerance for rotational
vibration, etc.

Most of the errors you see on drives are a result of media errors that
are big enough that the drive ECC can't correct them.  Errors are
often caused by head misses due to bad tracking, vibration from other
drives in the enclosure, the user kicking the cabinet at an
inopportune moment, etc.  I.e. external interference.  Other errors
are due to real imperfections of the media itself.

I would be surprised if a consumer grade drive doing more retries over 
several seconds rather than several rotations wasn't better able to 
correct for most of the transient problems you mention. So your comments 
about transient mechanical issues aren't telling me much, other than 
server drives being more likely to get vibration from other drives.

Enterprise drive firmware is about being more resistant to outside
factors as well as real media defects.  That firmware cost more to
develop than the consumer ditto.  And the vendors charge a premium for
it.

Other than possibly having more ECC bits there isn't much difference, as 
several people here have noted you don't want the drive to hang for 
several seconds trying this and that in a server environment. And given 
that there are a very small number of things to be done on error, like 
reread, seek away and back, recalibrate, etc, I would be amazed if 
vendors didn't just put all the code in the firmware and use a little 
table to determine which actions to take in what order, and how many 
times. The idea of some vast and complex code just doesn't fly, there 
aren't that many things to try.

--
Bill Davidsen <davidsen@xxxxxxx>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html