Re: Intel SSD (DC S3700) Power_Loss_Cap_Test failure

Jan Schermer <jan@xxxxxxxxxxx> · Wed, 3 Aug 2016 12:15:41 +0200

Make sure you are reading the right attribute and interpreting it right.
update-smart-drivedb sometimes makes wonders :)

I wonder what isdct tool would say the drive's life expectancy is with this workload? Are you really writing ~600TB/month??

Jan

> On 03 Aug 2016, at 12:06, Maxime Guyot <Maxime.Guyot@xxxxxxxxx> wrote:
> 
> Hi,
> 
> I haven’t had problems with Power_Loss_Cap_Test so far. 
> 
> Regarding Reallocated_Sector_Ct (SMART ID: 5/05h), you can check the “Available Reserved Space” (SMART ID: 232/E8h), the data sheet (http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3610-spec.pdf) reads:
> "This attribute reports the number of reserve blocks
> 
> 						remaining. The normalized value begins at 100 (64h),
> which corresponds to 100 percent availability of the
> reserved space. The threshold value for this attribute is
> 10 percent availability."
> 
> According to the SMART data you copied, it should be about 84% of the over provisioning left? Since the drive is pretty young, it might be some form of defect?
> I have a number of S3610 with ~150 DW, all SMART counters are their initial values (except for the temperature).
> 
> Cheers,
> Maxime
> 
> 
> 
> 
> 
> 
> 
> 
> On 03/08/16 11:12, "ceph-users on behalf of Daniel Swarbrick" <ceph-users-bounces@xxxxxxxxxxxxxx on behalf of daniel.swarbrick@xxxxxxxxxxxxxxxx> wrote:
> 
>> Hi Christian,
>> 
>> Intel drives are good, but apparently not infallible. I'm watching a DC
>> S3610 480GB die from reallocated sectors.
>> 
>> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
>> 5 Reallocated_Sector_Ct   -O--CK   081   081   000    -    756
>> 9 Power_On_Hours          -O--CK   100   100   000    -    1065
>> 12 Power_Cycle_Count       -O--CK   100   100   000    -    7
>> 175 Program_Fail_Count_Chip PO--CK   100   100   010    -    17454078318
>> 183 Runtime_Bad_Block       -O--CK   100   100   000    -    0
>> 184 End-to-End_Error        PO--CK   100   100   090    -    0
>> 187 Reported_Uncorrect      -O--CK   100   100   000    -    0
>> 190 Airflow_Temperature_Cel -O---K   070   065   000    -    30 (Min/Max
>> 25/35)
>> 192 Power-Off_Retract_Count -O--CK   100   100   000    -    6
>> 194 Temperature_Celsius     -O---K   100   100   000    -    30
>> 197 Current_Pending_Sector  -O--C-   100   100   000    -    1288
>> 199 UDMA_CRC_Error_Count    -OSRCK   100   100   000    -    0
>> 228 Power-off_Retract_Count -O--CK   100   100   000    -    63889
>> 232 Available_Reservd_Space PO--CK   084   084   010    -    0
>> 233 Media_Wearout_Indicator -O--CK   100   100   000    -    0
>> 241 Total_LBAs_Written      -O--CK   100   100   000    -    20131
>> 242 Total_LBAs_Read         -O--CK   100   100   000    -    92945
>> 
>> The Reallocated_Sector_Ct is increasing about once a minute. I'm not
>> sure how many reserved sectors the drive has, i.e., how soon before it
>> starts throwing write IO errors.
>> 
>> It's a very young drive, with only 1065 hours on the clock, and has not
>> even done two full drive-writes:
>> 
>> Device Statistics (GP Log 0x04)
>> Page Offset Size         Value  Description
>> 1  =====  =                =  == General Statistics (rev 2) ==
>> 1  0x008  4                7  Lifetime Power-On Resets
>> 1  0x018  6       1319318736  Logical Sectors Written
>> 1  0x020  6        137121729  Number of Write Commands
>> 1  0x028  6       6091245600  Logical Sectors Read
>> 1  0x030  6        115252407  Number of Read Commands
>> 
>> Fortunately this drive is not used as a Ceph journal. It's in a mdraid
>> RAID5 array :-|
>> 
>> Cheers,
>> Daniel
>> 
>> On 03/08/16 07:45, Christian Balzer wrote:
>>> 
>>> Hello,
>>> 
>>> not a Ceph specific issue, but this is probably the largest sample size of
>>> SSD users I'm familiar with. ^o^
>>> 
>>> This morning I was woken at 4:30 by Nagios, one of our Ceph nodes having a
>>> religious experience.
>>> 
>>> It turns out that the SMART check plugin I run to mostly get an early
>>> wearout warning detected a "Power_Loss_Cap_Test" failure in one of the
>>> 200GB DC S3700 used for journals.
>>> 
>>> While SMART is of the opinion that this drive is failing and will explode
>>> spectacularly any moment that particular failure is of little worries to
>>> me, never mind that I'll eventually replace this unit.
>>> 
>>> What brings me here is that this is the first time in over 3 years that an
>>> Intel SSD has shown a (harmless in this case) problem, so I'm wondering if
>>> this particular failure has been seen by others.
>>> 
>>> That of course entails people actually monitoring for these things. ^o^
>>> 
>>> Thanks,
>>> 
>>> Christian
>>> 
>> 
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com