Re: upgrade advice

"Martin K. Petersen" <martin.petersen@xxxxxxxxxx> · Mon, 12 Jan 2009 23:37:34 -0500

>>>>> "Bill" == Bill Davidsen <davidsen@xxxxxxx> writes:

>> Most of the errors you see on drives are a result of media errors
>> that are big enough that the drive ECC can't correct them.  Errors
>> are often caused by head misses due to bad tracking, vibration from
>> other drives in the enclosure, the user kicking the cabinet at an
>> inopportune moment, etc.  I.e. external interference.  Other errors
>> are due to real imperfections of the media itself.

Bill> I would be surprised if a consumer grade drive doing more retries
Bill> over several seconds rather than several rotations wasn't better
Bill> able to correct for most of the transient problems you mention.

Not all the problems I mentioned are of transient nature.  Several
common corruption scenarios are caused by the transient external factors
*at write time*.  No amount of retrying is going to fix something that
was badly written to begin with.  Doesn't even have to be the sector in
question.  Could be adjacent tracks that got clobbered.

Bill> Other than possibly having more ECC bits there isn't much
Bill> difference,

I mentioned better tracking/multiple sync marks as another crucial
difference.  That's a pretty huge deal in my book.

Nearline drive firmware also devotes resources to predicting impending
failure.  They have the ability to throttle the I/O pipeline if there's
an increased risk of write error due to excessive seeking, overheating,
etc.  That means that under load performance can be choppy.

That is unacceptable behavior in the consumer/interactivity
benchmarketing-focused market whereas making sure you write things
correctly is an absolute must in the enterprise space.  And the
non-deterministic performance characteristics are not such a big deal
when the drives are sitting behind an array head with non-volatile
cache.

Bill> as several people here have noted you don't want the drive to hang
Bill> for several seconds trying this and that in a server
Bill> environment. And given that there are a very small number of
Bill> things to be done on error, like reread, seek away and back,
Bill> recalibrate, etc, 

Again, you are talking about behavior when a transient read error is
detected.  My focus is the due diligence done by the firmware during
write operations.

It is correct that one of the defining characteristics of nearline
vs. consumer drives is the retry behavior.  But that's not the point I
was trying to make.

What I was trying to convey was that:

1. Contrary to popular belief there is no inherent mechanical difference
   between consumer and nearline drives.  Same heads, arms, motors, etc.
   The premium you pay is not for "mechanical ruggedness".  That's what
   most people assume when they are charged more(*).

2. The difference is largely in how the firmware encodes stuff on the
   physical platters in the drive, the internal housekeeping overhead.
   That difference between consumer and nearline is getting bigger with
   each generation of drives.

That said, I'm also sure you can appreciate that media defect tolerances
are likely to be different between nearline and consumer kit despite
coming off the same assembly line.

(*) Seagate recently put out some SAS nearline drives that have a
different logic board than their SATA cousins.  So there's actually a
real hardware difference in that series.  The fatter PCB with dual
processors enables even better integrity protection (on par with "real"
enterprise drives) albeit at lower duty cycles.

-- 
Martin K. Petersen	Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html