On 05/04/2011 08:31 PM, David Boreham wrote:
Here's my best theory at present : the failures ARE caused by cell
wear-out, but the SSD firmware is buggy in so far as it fails to boot
up and respond to host commands due to the wear-out state. So rather
than the expected outcome (SSD responds but has read-only behavior),
it appears to be (and is) dead. At least to my mind, this is a more
plausible explanation for the reported failures vs. the alternative
(SSD vendors are uniquely clueless at making basic electronics
subassemblies), especially considering the difficulty in testing the
firmware under all possible wear-out conditions.
One question worth asking is : in the cases you were involved in, was
manufacturer failure analysis performed (and if so what was the
failure cause reported?).
Unfortunately not. Many of the people I deal with, particularly the
ones with budgets to be early SSD adopters, are not the sort to return
things that have failed to the vendor. In some of these shops, if the
data can't be securely erased first, it doesn't leave the place. The
idea that some trivial fix at the hardware level might bring the drive
back to life, data intact, is terrifying to many businesses when drives
fail hard.
Your bigger point, that this could just easily be software failures due
to unexpected corner cases rather than hardware issues, is both a fair
one to raise and even more scary.
Intel claims their Annual Failure Rate (AFR) on their SSDs in IT
deployments (not OEM ones) is 0.6%. Typical measured AFR rates for
mechanical drives is around 2% during their first year, spiking to 5%
afterwards. I suspect that Intel's numbers are actually much better
than the other manufacturers here, so a SSD from anyone else can
easily be less reliable than a regular hard drive still.
Hmm, this is speculation I don't support (non-intel vendors have a 10x
worse early failure rate). The entire industry uses very similar
processes (often the same factories). One rogue vendor with a bad
process...sure, but all of them ??
I was postulating that you only have to be 4X as bad as Intel to reach
2.4%, and then be worse than a mechanical drive for early failures. If
you look at http://labs.google.com/papers/disk_failures.pdf you can see
there's a 5:1 ratio in first-year AFR just between light and heavy usage
on the drive. So a 4:1 ratio between best and worst manufacturer for
SSD seemed possible. Plenty of us have seen particular drive models
that were much more than 4X as bad as average ones among regular hard
drives.
--
Greg Smith 2ndQuadrant US greg@xxxxxxxxxxxxxxx Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general