Re: Fwd: Re: SSDD reliability

David Boreham <david_list@xxxxxxxxxxx> · Wed, 04 May 2011 18:31:55 -0600

On 5/4/2011 6:02 PM, Greg Smith wrote:
On 05/04/2011 03:24 PM, David Boreham wrote:
So if someone says that SSDs have "failed", I'll assume that they 
suffered from Flash cell
wear-out unless there is compelling proof to the contrary.

I've been involved in four recovery situations similar to the one 
described in that coding horror article, and zero of them were flash 
wear-out issues.  The telling sign is that the device should fail to 
read-only mode if it wears out.  That's not what I've seen happen 
though; what reports from the field are saying is that sudden, 
complete failures are the more likely event.

Sorry to harp on this (last time I promise), but I somewhat do know what 
I'm talking about, and I'm quite motivated to get to the bottom of this 
"SSDs fail, but not for the reason you'd suspect" syndrome (because we 
want to deploy SSDs in production soon).

Here's my best theory at present : the failures ARE caused by cell 
wear-out, but the SSD firmware is buggy in so far as it fails to boot up 
and respond to host commands due to the wear-out state. So rather than 
the expected outcome (SSD responds but has read-only behavior), it 
appears to be (and is) dead. At least to my mind, this is a more 
plausible explanation for the reported failures vs. the alternative (SSD 
vendors are uniquely clueless at making basic electronics 
subassemblies), especially considering the difficulty in testing the 
firmware under all possible wear-out conditions.

One question worth asking is : in the cases you were involved in, was 
manufacturer failure analysis performed (and if so what was the failure 
cause reported?).

The environment inside a PC of any sort, desktop or particularly 
portable, is not a predictable environment.  Just because the drives 
should be less prone to heat and vibration issues doesn't mean 
individual components can't slide out of spec because of them.  And 
hard drive manufacturers have a giant head start at working out 
reliability bugs in that area.  You can't design that sort of issue 
out of a new product in advance; all you can do is analyze returns 
from the field, see what you screwed up, and do another design rev to 
address it.
That's not really how it works (I've been the guy responsible for this 
for 10 years in a prior career, so I feel somewhat qualified to argue 
about this). The technology and manufacturing processes are common 
across many different types of product. They either all work , or they 
all fail. In fact, I'll eat my keyboard if SSDs are not manufactured on 
the exact same production lines as regular disk drives, DRAM modules, 
and so on (manufacturing tends to be contracted to high volume factories 
that make all kinds of things on the same lines). The only different 
thing about SSDs vs. any other electronics you'd come across is the 
Flash devices themselves. However, those are used in extraordinary high 
volumes all over the place and if there were a failure mode with the 
incidence suggested by these stories, I suspect we'd be reading about it 
on the front page of the WSJ.

Intel claims their Annual Failure Rate (AFR) on their SSDs in IT 
deployments (not OEM ones) is 0.6%.  Typical measured AFR rates for 
mechanical drives is around 2% during their first year, spiking to 5% 
afterwards.  I suspect that Intel's numbers are actually much better 
than the other manufacturers here, so a SSD from anyone else can 
easily be less reliable than a regular hard drive still.

Hmm, this is speculation I don't support (non-intel vendors have a 10x 
worse early failure rate). The entire industry uses very similar 
processes (often the same factories). One rogue vendor with a bad 
process...sure, but all of them ??

For the benefit of anyone reading this who may have a failed SSD : all 
the tier 1 manufacturers have departments dedicated to the analysis of 
product that fails in the field. With some persistence, you can usually 
get them to take a failed unit and put it through the FA process (and 
tell you why it failed). For example, here's a job posting for someone 
who would do this work :
http://www.internmatch.com/internships/4620/intel/ssd-failure-analysis-intern-592345
I'd encourage you to at least try to get your failed devices into the 
failure analysis pile. If units are not returned, the manufacturer never 
finds out what broke, and therefore can't fix the problem.

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general