Re: Intel SSD (DC S3700) Power_Loss_Cap_Test failure

Christian Balzer <chibi@xxxxxxx> · Wed, 3 Aug 2016 21:15:22 +0900

Hello,

On Wed, 3 Aug 2016 13:42:50 +0200 Jan Schermer wrote:

> Christian, can you post your values for Power_Loss_Cap_Test on the drive which is failing?
>
Sure:
---
175 Power_Loss_Cap_Test     0x0033   001   001   010    Pre-fail  Always   FAILING_NOW 1 (47 942)
---

Now according to the Intel data sheet that value of 1 means failed, NOT
the actual buffer time it usually means, like this on the neighboring SSD:
---
175 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       614 (47 944)
---

And my 800GB DC S3610s have more than 10 times the endurance, my guess is
a combo of larger cache and slower writes:
---
175 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       8390 (22 7948)
---

I'll definitely leave that "failing" SSD in place until it has done the
next self-check.

Christian

> Thanks
> Jan
> 
> > On 03 Aug 2016, at 13:33, Christian Balzer <chibi@xxxxxxx> wrote:
> > 
> > 
> > Hello,
> > 
> > yeah, I was particular interested in the Power_Loss_Cap_Test bit, as it
> > seemed to be such an odd thing to fail (given that's not single capacitor).
> > 
> > As for your Reallocated_Sector_Ct, that's really odd and definitely a RMA
> > worthy issue. 
> > 
> > For the record, Intel SSDs use (typically 24) sectors when doing firmware
> > upgrades, so this is a totally healthy 3610. ^o^
> > ---
> >  5 Reallocated_Sector_Ct   0x0032   099   099   000    Old_age   Always       -       24
> > ---
> > 
> > Christian
> > 
> > On Wed, 3 Aug 2016 13:12:53 +0200 Daniel Swarbrick wrote:
> > 
> >> Right, I actually updated to smartmontools 6.5+svn4324, which now
> >> properly supports this drive model. Some of the smart attr names have
> >> changed, and make more sense now (and there are no more "Unknowns"):
> >> 
> >> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
> >>  5 Reallocated_Sector_Ct   -O--CK   081   081   000    -    944
> >>  9 Power_On_Hours          -O--CK   100   100   000    -    1067
> >> 12 Power_Cycle_Count       -O--CK   100   100   000    -    7
> >> 170 Available_Reservd_Space PO--CK   085   085   010    -    0
> >> 171 Program_Fail_Count      -O--CK   100   100   000    -    0
> >> 172 Erase_Fail_Count        -O--CK   100   100   000    -    68
> >> 174 Unsafe_Shutdown_Count   -O--CK   100   100   000    -    6
> >> 175 Power_Loss_Cap_Test     PO--CK   100   100   010    -    6510 (4 4307)
> >> 183 SATA_Downshift_Count    -O--CK   100   100   000    -    0
> >> 184 End-to-End_Error        PO--CK   100   100   090    -    0
> >> 187 Reported_Uncorrect      -O--CK   100   100   000    -    0
> >> 190 Temperature_Case        -O---K   070   065   000    -    30 (Min/Max
> >> 25/35)
> >> 192 Unsafe_Shutdown_Count   -O--CK   100   100   000    -    6
> >> 194 Temperature_Internal    -O---K   100   100   000    -    30
> >> 197 Current_Pending_Sector  -O--C-   100   100   000    -    1100
> >> 199 CRC_Error_Count         -OSRCK   100   100   000    -    0
> >> 225 Host_Writes_32MiB       -O--CK   100   100   000    -    20135
> >> 226 Workld_Media_Wear_Indic -O--CK   100   100   000    -    20
> >> 227 Workld_Host_Reads_Perc  -O--CK   100   100   000    -    82
> >> 228 Workload_Minutes        -O--CK   100   100   000    -    64012
> >> 232 Available_Reservd_Space PO--CK   084   084   010    -    0
> >> 233 Media_Wearout_Indicator -O--CK   100   100   000    -    0
> >> 234 Thermal_Throttle        -O--CK   100   100   000    -    0/0
> >> 241 Host_Writes_32MiB       -O--CK   100   100   000    -    20135
> >> 242 Host_Reads_32MiB        -O--CK   100   100   000    -    92945
> >> 243 NAND_Writes_32MiB       -O--CK   100   100   000    -    95289
> >> 
> >> Reallocated_Sector_Ct is still increasing, but Available_Reservd_Space
> >> seems to be holding steady.
> >> 
> >> AFAIK, we've only had one other S3610 fail, and it seemed to be a sudden
> >> death. The drive simply disappeared from the controller one day, and
> >> could no longer be detected.
> >> 
> >> On 03/08/16 12:15, Jan Schermer wrote:
> >>> Make sure you are reading the right attribute and interpreting it right.
> >>> update-smart-drivedb sometimes makes wonders :)
> >>> 
> >>> I wonder what isdct tool would say the drive's life expectancy is with this workload? Are you really writing ~600TB/month??
> >>> 
> >>> Jan
> >>> 
> >> 
> >> 
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> 
> > 
> > 
> > -- 
> > Christian Balzer        Network/Systems Engineer                
> > chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com