>>>>> "Bill" == Bill Davidsen <davidsen@xxxxxxx> writes: >> Most of the errors you see on drives are a result of media errors >> that are big enough that the drive ECC can't correct them. Errors >> are often caused by head misses due to bad tracking, vibration from >> other drives in the enclosure, the user kicking the cabinet at an >> inopportune moment, etc. I.e. external interference. Other errors >> are due to real imperfections of the media itself. Bill> I would be surprised if a consumer grade drive doing more retries Bill> over several seconds rather than several rotations wasn't better Bill> able to correct for most of the transient problems you mention. Not all the problems I mentioned are of transient nature. Several common corruption scenarios are caused by the transient external factors *at write time*. No amount of retrying is going to fix something that was badly written to begin with. Doesn't even have to be the sector in question. Could be adjacent tracks that got clobbered. Bill> Other than possibly having more ECC bits there isn't much Bill> difference, I mentioned better tracking/multiple sync marks as another crucial difference. That's a pretty huge deal in my book. Nearline drive firmware also devotes resources to predicting impending failure. They have the ability to throttle the I/O pipeline if there's an increased risk of write error due to excessive seeking, overheating, etc. That means that under load performance can be choppy. That is unacceptable behavior in the consumer/interactivity benchmarketing-focused market whereas making sure you write things correctly is an absolute must in the enterprise space. And the non-deterministic performance characteristics are not such a big deal when the drives are sitting behind an array head with non-volatile cache. Bill> as several people here have noted you don't want the drive to hang Bill> for several seconds trying this and that in a server Bill> environment. And given that there are a very small number of Bill> things to be done on error, like reread, seek away and back, Bill> recalibrate, etc, Again, you are talking about behavior when a transient read error is detected. My focus is the due diligence done by the firmware during write operations. It is correct that one of the defining characteristics of nearline vs. consumer drives is the retry behavior. But that's not the point I was trying to make. What I was trying to convey was that: 1. Contrary to popular belief there is no inherent mechanical difference between consumer and nearline drives. Same heads, arms, motors, etc. The premium you pay is not for "mechanical ruggedness". That's what most people assume when they are charged more(*). 2. The difference is largely in how the firmware encodes stuff on the physical platters in the drive, the internal housekeeping overhead. That difference between consumer and nearline is getting bigger with each generation of drives. That said, I'm also sure you can appreciate that media defect tolerances are likely to be different between nearline and consumer kit despite coming off the same assembly line. (*) Seagate recently put out some SAS nearline drives that have a different logic board than their SATA cousins. So there's actually a real hardware difference in that series. The fatter PCB with dual processors enables even better integrity protection (on par with "real" enterprise drives) albeit at lower duty cycles. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html