Search Postgresql Archives

Re: Fwd: Re: SSDD reliability

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 5/4/2011 6:02 PM, Greg Smith wrote:
On 05/04/2011 03:24 PM, David Boreham wrote:
So if someone says that SSDs have "failed", I'll assume that they suffered from Flash cell
wear-out unless there is compelling proof to the contrary.

I've been involved in four recovery situations similar to the one described in that coding horror article, and zero of them were flash wear-out issues. The telling sign is that the device should fail to read-only mode if it wears out. That's not what I've seen happen though; what reports from the field are saying is that sudden, complete failures are the more likely event.

Sorry to harp on this (last time I promise), but I somewhat do know what I'm talking about, and I'm quite motivated to get to the bottom of this "SSDs fail, but not for the reason you'd suspect" syndrome (because we want to deploy SSDs in production soon).

Here's my best theory at present : the failures ARE caused by cell wear-out, but the SSD firmware is buggy in so far as it fails to boot up and respond to host commands due to the wear-out state. So rather than the expected outcome (SSD responds but has read-only behavior), it appears to be (and is) dead. At least to my mind, this is a more plausible explanation for the reported failures vs. the alternative (SSD vendors are uniquely clueless at making basic electronics subassemblies), especially considering the difficulty in testing the firmware under all possible wear-out conditions.

One question worth asking is : in the cases you were involved in, was manufacturer failure analysis performed (and if so what was the failure cause reported?).

The environment inside a PC of any sort, desktop or particularly portable, is not a predictable environment. Just because the drives should be less prone to heat and vibration issues doesn't mean individual components can't slide out of spec because of them. And hard drive manufacturers have a giant head start at working out reliability bugs in that area. You can't design that sort of issue out of a new product in advance; all you can do is analyze returns from the field, see what you screwed up, and do another design rev to address it.
That's not really how it works (I've been the guy responsible for this for 10 years in a prior career, so I feel somewhat qualified to argue about this). The technology and manufacturing processes are common across many different types of product. They either all work , or they all fail. In fact, I'll eat my keyboard if SSDs are not manufactured on the exact same production lines as regular disk drives, DRAM modules, and so on (manufacturing tends to be contracted to high volume factories that make all kinds of things on the same lines). The only different thing about SSDs vs. any other electronics you'd come across is the Flash devices themselves. However, those are used in extraordinary high volumes all over the place and if there were a failure mode with the incidence suggested by these stories, I suspect we'd be reading about it on the front page of the WSJ.


Intel claims their Annual Failure Rate (AFR) on their SSDs in IT deployments (not OEM ones) is 0.6%. Typical measured AFR rates for mechanical drives is around 2% during their first year, spiking to 5% afterwards. I suspect that Intel's numbers are actually much better than the other manufacturers here, so a SSD from anyone else can easily be less reliable than a regular hard drive still.

Hmm, this is speculation I don't support (non-intel vendors have a 10x worse early failure rate). The entire industry uses very similar processes (often the same factories). One rogue vendor with a bad process...sure, but all of them ??

For the benefit of anyone reading this who may have a failed SSD : all the tier 1 manufacturers have departments dedicated to the analysis of product that fails in the field. With some persistence, you can usually get them to take a failed unit and put it through the FA process (and tell you why it failed). For example, here's a job posting for someone who would do this work :
http://www.internmatch.com/internships/4620/intel/ssd-failure-analysis-intern-592345
I'd encourage you to at least try to get your failed devices into the failure analysis pile. If units are not returned, the manufacturer never finds out what broke, and therefore can't fix the problem.






--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux