On 5/4/2011 6:02 PM, Greg Smith wrote:
On 05/04/2011 03:24 PM, David Boreham wrote:
So if someone says that SSDs have "failed", I'll assume that they
suffered from Flash cell
wear-out unless there is compelling proof to the contrary.
I've been involved in four recovery situations similar to the one
described in that coding horror article, and zero of them were flash
wear-out issues. The telling sign is that the device should fail to
read-only mode if it wears out. That's not what I've seen happen
though; what reports from the field are saying is that sudden,
complete failures are the more likely event.
Sorry to harp on this (last time I promise), but I somewhat do know what
I'm talking about, and I'm quite motivated to get to the bottom of this
"SSDs fail, but not for the reason you'd suspect" syndrome (because we
want to deploy SSDs in production soon).
Here's my best theory at present : the failures ARE caused by cell
wear-out, but the SSD firmware is buggy in so far as it fails to boot up
and respond to host commands due to the wear-out state. So rather than
the expected outcome (SSD responds but has read-only behavior), it
appears to be (and is) dead. At least to my mind, this is a more
plausible explanation for the reported failures vs. the alternative (SSD
vendors are uniquely clueless at making basic electronics
subassemblies), especially considering the difficulty in testing the
firmware under all possible wear-out conditions.
One question worth asking is : in the cases you were involved in, was
manufacturer failure analysis performed (and if so what was the failure
cause reported?).
The environment inside a PC of any sort, desktop or particularly
portable, is not a predictable environment. Just because the drives
should be less prone to heat and vibration issues doesn't mean
individual components can't slide out of spec because of them. And
hard drive manufacturers have a giant head start at working out
reliability bugs in that area. You can't design that sort of issue
out of a new product in advance; all you can do is analyze returns
from the field, see what you screwed up, and do another design rev to
address it.
That's not really how it works (I've been the guy responsible for this
for 10 years in a prior career, so I feel somewhat qualified to argue
about this). The technology and manufacturing processes are common
across many different types of product. They either all work , or they
all fail. In fact, I'll eat my keyboard if SSDs are not manufactured on
the exact same production lines as regular disk drives, DRAM modules,
and so on (manufacturing tends to be contracted to high volume factories
that make all kinds of things on the same lines). The only different
thing about SSDs vs. any other electronics you'd come across is the
Flash devices themselves. However, those are used in extraordinary high
volumes all over the place and if there were a failure mode with the
incidence suggested by these stories, I suspect we'd be reading about it
on the front page of the WSJ.
Intel claims their Annual Failure Rate (AFR) on their SSDs in IT
deployments (not OEM ones) is 0.6%. Typical measured AFR rates for
mechanical drives is around 2% during their first year, spiking to 5%
afterwards. I suspect that Intel's numbers are actually much better
than the other manufacturers here, so a SSD from anyone else can
easily be less reliable than a regular hard drive still.
Hmm, this is speculation I don't support (non-intel vendors have a 10x
worse early failure rate). The entire industry uses very similar
processes (often the same factories). One rogue vendor with a bad
process...sure, but all of them ??
For the benefit of anyone reading this who may have a failed SSD : all
the tier 1 manufacturers have departments dedicated to the analysis of
product that fails in the field. With some persistence, you can usually
get them to take a failed unit and put it through the FA process (and
tell you why it failed). For example, here's a job posting for someone
who would do this work :
http://www.internmatch.com/internships/4620/intel/ssd-failure-analysis-intern-592345
I'd encourage you to at least try to get your failed devices into the
failure analysis pile. If units are not returned, the manufacturer never
finds out what broke, and therefore can't fix the problem.
--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general