Re: Intel SSD (DC S3700) Power_Loss_Cap_Test failure

Maxime Guyot <Maxime.Guyot@xxxxxxxxx> · Wed, 3 Aug 2016 10:06:27 +0000

Hi,

I haven’t had problems with Power_Loss_Cap_Test so far. 

Regarding Reallocated_Sector_Ct (SMART ID: 5/05h), you can check the “Available Reserved Space” (SMART ID: 232/E8h), the data sheet (http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3610-spec.pdf) reads:
"This attribute reports the number of reserve blocks

						remaining. The normalized value begins at 100 (64h),
which corresponds to 100 percent availability of the
reserved space. The threshold value for this attribute is
10 percent availability."

According to the SMART data you copied, it should be about 84% of the over provisioning left? Since the drive is pretty young, it might be some form of defect?
I have a number of S3610 with ~150 DW, all SMART counters are their initial values (except for the temperature).

Cheers,
Maxime

On 03/08/16 11:12, "ceph-users on behalf of Daniel Swarbrick" <ceph-users-bounces@xxxxxxxxxxxxxx on behalf of daniel.swarbrick@xxxxxxxxxxxxxxxx> wrote:

>Hi Christian,
>
>Intel drives are good, but apparently not infallible. I'm watching a DC
>S3610 480GB die from reallocated sectors.
>
>ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
>  5 Reallocated_Sector_Ct   -O--CK   081   081   000    -    756
>  9 Power_On_Hours          -O--CK   100   100   000    -    1065
> 12 Power_Cycle_Count       -O--CK   100   100   000    -    7
>175 Program_Fail_Count_Chip PO--CK   100   100   010    -    17454078318
>183 Runtime_Bad_Block       -O--CK   100   100   000    -    0
>184 End-to-End_Error        PO--CK   100   100   090    -    0
>187 Reported_Uncorrect      -O--CK   100   100   000    -    0
>190 Airflow_Temperature_Cel -O---K   070   065   000    -    30 (Min/Max
>25/35)
>192 Power-Off_Retract_Count -O--CK   100   100   000    -    6
>194 Temperature_Celsius     -O---K   100   100   000    -    30
>197 Current_Pending_Sector  -O--C-   100   100   000    -    1288
>199 UDMA_CRC_Error_Count    -OSRCK   100   100   000    -    0
>228 Power-off_Retract_Count -O--CK   100   100   000    -    63889
>232 Available_Reservd_Space PO--CK   084   084   010    -    0
>233 Media_Wearout_Indicator -O--CK   100   100   000    -    0
>241 Total_LBAs_Written      -O--CK   100   100   000    -    20131
>242 Total_LBAs_Read         -O--CK   100   100   000    -    92945
>
>The Reallocated_Sector_Ct is increasing about once a minute. I'm not
>sure how many reserved sectors the drive has, i.e., how soon before it
>starts throwing write IO errors.
>
>It's a very young drive, with only 1065 hours on the clock, and has not
>even done two full drive-writes:
>
>Device Statistics (GP Log 0x04)
>Page Offset Size         Value  Description
>  1  =====  =                =  == General Statistics (rev 2) ==
>  1  0x008  4                7  Lifetime Power-On Resets
>  1  0x018  6       1319318736  Logical Sectors Written
>  1  0x020  6        137121729  Number of Write Commands
>  1  0x028  6       6091245600  Logical Sectors Read
>  1  0x030  6        115252407  Number of Read Commands
>
>Fortunately this drive is not used as a Ceph journal. It's in a mdraid
>RAID5 array :-|
>
>Cheers,
>Daniel
>
>On 03/08/16 07:45, Christian Balzer wrote:
>> 
>> Hello,
>> 
>> not a Ceph specific issue, but this is probably the largest sample size of
>> SSD users I'm familiar with. ^o^
>> 
>> This morning I was woken at 4:30 by Nagios, one of our Ceph nodes having a
>> religious experience.
>> 
>> It turns out that the SMART check plugin I run to mostly get an early
>> wearout warning detected a "Power_Loss_Cap_Test" failure in one of the
>> 200GB DC S3700 used for journals.
>> 
>> While SMART is of the opinion that this drive is failing and will explode
>> spectacularly any moment that particular failure is of little worries to
>> me, never mind that I'll eventually replace this unit.
>> 
>> What brings me here is that this is the first time in over 3 years that an
>> Intel SSD has shown a (harmless in this case) problem, so I'm wondering if
>> this particular failure has been seen by others.
>> 
>> That of course entails people actually monitoring for these things. ^o^
>> 
>> Thanks,
>> 
>> Christian
>> 
>
>
>_______________________________________________
>ceph-users mailing list
>ceph-users@xxxxxxxxxxxxxx
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com