Re: Intel SSD (DC S3700) Power_Loss_Cap_Test failure

Christian Balzer <chibi@xxxxxxx> · Mon, 29 Aug 2016 09:17:45 +0900

Hello,

as a follow-up, conclusion and dire warning to all who happen to encounter
this failure mode:

The server with that failed power loss capacitor SSD had a religious
experience 2 days ago and needed a power cycle to revive it.

Now in theory the data should have been safe, as the drive had minutes to
scribble away it's cache.

Alas what happened is that the SSD bricked itself, it's not accessible any
longer and the only meaningful output from "smartctl -a" is:
"SMART overall-health self-assessment test result: FAILED!"

I'm trying to think of a failure mode where the capacitor would cause
something like this and am coming up blank, so my theories at this time
are:

1. Something more substantial was failing the the error was a symptom, not
the cause.

2. Intel's "we won't let you deal with potentially broken data" rule
strikes again (they brick SSDs that reach max wear-out levels) and a
failed power cap triggers such a rule.

Either way, if you ever encounter this problem, get a replacement ASAP,
and if used as journal SSD, shut down all associated OSDs, flush the
journals and replace it.

Christian

On Wed, 3 Aug 2016 21:15:22 +0900 Christian Balzer wrote:
> 
> Hello,
> 
> On Wed, 3 Aug 2016 13:42:50 +0200 Jan Schermer wrote:
> 
> > Christian, can you post your values for Power_Loss_Cap_Test on the drive which is failing?
> >
> Sure:
> ---
> 175 Power_Loss_Cap_Test     0x0033   001   001   010    Pre-fail  Always   FAILING_NOW 1 (47 942)
> ---
> 
> Now according to the Intel data sheet that value of 1 means failed, NOT
> the actual buffer time it usually means, like this on the neighboring SSD:
> ---
> 175 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       614 (47 944)
> ---
> 
> And my 800GB DC S3610s have more than 10 times the endurance, my guess is
> a combo of larger cache and slower writes:
> ---
> 175 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       8390 (22 7948)
> ---
> 
> I'll definitely leave that "failing" SSD in place until it has done the
> next self-check.
> 
> Christian
> 
> > Thanks
> > Jan
> > 
> > > On 03 Aug 2016, at 13:33, Christian Balzer <chibi@xxxxxxx> wrote:
> > > 
> > > 
> > > Hello,
> > > 
> > > yeah, I was particular interested in the Power_Loss_Cap_Test bit, as it
> > > seemed to be such an odd thing to fail (given that's not single capacitor).
> > > 
> > > As for your Reallocated_Sector_Ct, that's really odd and definitely a RMA
> > > worthy issue. 
> > > 
> > > For the record, Intel SSDs use (typically 24) sectors when doing firmware
> > > upgrades, so this is a totally healthy 3610. ^o^
> > > ---
> > >  5 Reallocated_Sector_Ct   0x0032   099   099   000    Old_age   Always       -       24
> > > ---
> > > 
> > > Christian
> > > 
> > > On Wed, 3 Aug 2016 13:12:53 +0200 Daniel Swarbrick wrote:
> > > 
> > >> Right, I actually updated to smartmontools 6.5+svn4324, which now
> > >> properly supports this drive model. Some of the smart attr names have
> > >> changed, and make more sense now (and there are no more "Unknowns"):
> > >> 
> > >> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
> > >>  5 Reallocated_Sector_Ct   -O--CK   081   081   000    -    944
> > >>  9 Power_On_Hours          -O--CK   100   100   000    -    1067
> > >> 12 Power_Cycle_Count       -O--CK   100   100   000    -    7
> > >> 170 Available_Reservd_Space PO--CK   085   085   010    -    0
> > >> 171 Program_Fail_Count      -O--CK   100   100   000    -    0
> > >> 172 Erase_Fail_Count        -O--CK   100   100   000    -    68
> > >> 174 Unsafe_Shutdown_Count   -O--CK   100   100   000    -    6
> > >> 175 Power_Loss_Cap_Test     PO--CK   100   100   010    -    6510 (4 4307)
> > >> 183 SATA_Downshift_Count    -O--CK   100   100   000    -    0
> > >> 184 End-to-End_Error        PO--CK   100   100   090    -    0
> > >> 187 Reported_Uncorrect      -O--CK   100   100   000    -    0
> > >> 190 Temperature_Case        -O---K   070   065   000    -    30 (Min/Max
> > >> 25/35)
> > >> 192 Unsafe_Shutdown_Count   -O--CK   100   100   000    -    6
> > >> 194 Temperature_Internal    -O---K   100   100   000    -    30
> > >> 197 Current_Pending_Sector  -O--C-   100   100   000    -    1100
> > >> 199 CRC_Error_Count         -OSRCK   100   100   000    -    0
> > >> 225 Host_Writes_32MiB       -O--CK   100   100   000    -    20135
> > >> 226 Workld_Media_Wear_Indic -O--CK   100   100   000    -    20
> > >> 227 Workld_Host_Reads_Perc  -O--CK   100   100   000    -    82
> > >> 228 Workload_Minutes        -O--CK   100   100   000    -    64012
> > >> 232 Available_Reservd_Space PO--CK   084   084   010    -    0
> > >> 233 Media_Wearout_Indicator -O--CK   100   100   000    -    0
> > >> 234 Thermal_Throttle        -O--CK   100   100   000    -    0/0
> > >> 241 Host_Writes_32MiB       -O--CK   100   100   000    -    20135
> > >> 242 Host_Reads_32MiB        -O--CK   100   100   000    -    92945
> > >> 243 NAND_Writes_32MiB       -O--CK   100   100   000    -    95289
> > >> 
> > >> Reallocated_Sector_Ct is still increasing, but Available_Reservd_Space
> > >> seems to be holding steady.
> > >> 
> > >> AFAIK, we've only had one other S3610 fail, and it seemed to be a sudden
> > >> death. The drive simply disappeared from the controller one day, and
> > >> could no longer be detected.
> > >> 
> > >> On 03/08/16 12:15, Jan Schermer wrote:
> > >>> Make sure you are reading the right attribute and interpreting it right.
> > >>> update-smart-drivedb sometimes makes wonders :)
> > >>> 
> > >>> I wonder what isdct tool would say the drive's life expectancy is with this workload? Are you really writing ~600TB/month??
> > >>> 
> > >>> Jan
> > >>> 
> > >> 
> > >> 
> > >> _______________________________________________
> > >> ceph-users mailing list
> > >> ceph-users@xxxxxxxxxxxxxx
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >> 
> > > 
> > > 
> > > -- 
> > > Christian Balzer        Network/Systems Engineer                
> > > chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> > > http://www.gol.com/
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> > 
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com