Re: Intel SSD (DC S3700) Power_Loss_Cap_Test failure

Daniel Swarbrick <daniel.swarbrick@xxxxxxxxxxxxxxxx> · Wed, 3 Aug 2016 11:12:47 +0200

Hi Christian,

Intel drives are good, but apparently not infallible. I'm watching a DC
S3610 480GB die from reallocated sectors.

ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  5 Reallocated_Sector_Ct   -O--CK   081   081   000    -    756
  9 Power_On_Hours          -O--CK   100   100   000    -    1065
 12 Power_Cycle_Count       -O--CK   100   100   000    -    7
175 Program_Fail_Count_Chip PO--CK   100   100   010    -    17454078318
183 Runtime_Bad_Block       -O--CK   100   100   000    -    0
184 End-to-End_Error        PO--CK   100   100   090    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
190 Airflow_Temperature_Cel -O---K   070   065   000    -    30 (Min/Max
25/35)
192 Power-Off_Retract_Count -O--CK   100   100   000    -    6
194 Temperature_Celsius     -O---K   100   100   000    -    30
197 Current_Pending_Sector  -O--C-   100   100   000    -    1288
199 UDMA_CRC_Error_Count    -OSRCK   100   100   000    -    0
228 Power-off_Retract_Count -O--CK   100   100   000    -    63889
232 Available_Reservd_Space PO--CK   084   084   010    -    0
233 Media_Wearout_Indicator -O--CK   100   100   000    -    0
241 Total_LBAs_Written      -O--CK   100   100   000    -    20131
242 Total_LBAs_Read         -O--CK   100   100   000    -    92945

The Reallocated_Sector_Ct is increasing about once a minute. I'm not
sure how many reserved sectors the drive has, i.e., how soon before it
starts throwing write IO errors.

It's a very young drive, with only 1065 hours on the clock, and has not
even done two full drive-writes:

Device Statistics (GP Log 0x04)
Page Offset Size         Value  Description
  1  =====  =                =  == General Statistics (rev 2) ==
  1  0x008  4                7  Lifetime Power-On Resets
  1  0x018  6       1319318736  Logical Sectors Written
  1  0x020  6        137121729  Number of Write Commands
  1  0x028  6       6091245600  Logical Sectors Read
  1  0x030  6        115252407  Number of Read Commands

Fortunately this drive is not used as a Ceph journal. It's in a mdraid
RAID5 array :-|

Cheers,
Daniel

On 03/08/16 07:45, Christian Balzer wrote:
> 
> Hello,
> 
> not a Ceph specific issue, but this is probably the largest sample size of
> SSD users I'm familiar with. ^o^
> 
> This morning I was woken at 4:30 by Nagios, one of our Ceph nodes having a
> religious experience.
> 
> It turns out that the SMART check plugin I run to mostly get an early
> wearout warning detected a "Power_Loss_Cap_Test" failure in one of the
> 200GB DC S3700 used for journals.
> 
> While SMART is of the opinion that this drive is failing and will explode
> spectacularly any moment that particular failure is of little worries to
> me, never mind that I'll eventually replace this unit.
> 
> What brings me here is that this is the first time in over 3 years that an
> Intel SSD has shown a (harmless in this case) problem, so I'm wondering if
> this particular failure has been seen by others.
> 
> That of course entails people actually monitoring for these things. ^o^
> 
> Thanks,
> 
> Christian
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com