Re: Intel SSD (DC S3700) Power_Loss_Cap_Test failure

Jan Schermer <jan@xxxxxxxxxxx> · Wed, 3 Aug 2016 12:16:57 +0200

I'm a fool, I miscalculated the writes by a factor of 1000 of course :-)
600GB/month is not much for S36xx at all, must be some sort of defect then...

Jan

> On 03 Aug 2016, at 12:15, Jan Schermer <jan@xxxxxxxxxxx> wrote:
> 
> Make sure you are reading the right attribute and interpreting it right.
> update-smart-drivedb sometimes makes wonders :)
> 
> I wonder what isdct tool would say the drive's life expectancy is with this workload? Are you really writing ~600TB/month??
> 
> Jan
> 
> 
>> On 03 Aug 2016, at 12:06, Maxime Guyot <Maxime.Guyot@xxxxxxxxx> wrote:
>> 
>> Hi,
>> 
>> I haven’t had problems with Power_Loss_Cap_Test so far. 
>> 
>> Regarding Reallocated_Sector_Ct (SMART ID: 5/05h), you can check the “Available Reserved Space” (SMART ID: 232/E8h), the data sheet (http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3610-spec.pdf) reads:
>> "This attribute reports the number of reserve blocks
>> 
>> 						remaining. The normalized value begins at 100 (64h),
>> which corresponds to 100 percent availability of the
>> reserved space. The threshold value for this attribute is
>> 10 percent availability."
>> 
>> According to the SMART data you copied, it should be about 84% of the over provisioning left? Since the drive is pretty young, it might be some form of defect?
>> I have a number of S3610 with ~150 DW, all SMART counters are their initial values (except for the temperature).
>> 
>> Cheers,
>> Maxime
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On 03/08/16 11:12, "ceph-users on behalf of Daniel Swarbrick" <ceph-users-bounces@xxxxxxxxxxxxxx on behalf of daniel.swarbrick@xxxxxxxxxxxxxxxx> wrote:
>> 
>>> Hi Christian,
>>> 
>>> Intel drives are good, but apparently not infallible. I'm watching a DC
>>> S3610 480GB die from reallocated sectors.
>>> 
>>> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
>>> 5 Reallocated_Sector_Ct   -O--CK   081   081   000    -    756
>>> 9 Power_On_Hours          -O--CK   100   100   000    -    1065
>>> 12 Power_Cycle_Count       -O--CK   100   100   000    -    7
>>> 175 Program_Fail_Count_Chip PO--CK   100   100   010    -    17454078318
>>> 183 Runtime_Bad_Block       -O--CK   100   100   000    -    0
>>> 184 End-to-End_Error        PO--CK   100   100   090    -    0
>>> 187 Reported_Uncorrect      -O--CK   100   100   000    -    0
>>> 190 Airflow_Temperature_Cel -O---K   070   065   000    -    30 (Min/Max
>>> 25/35)
>>> 192 Power-Off_Retract_Count -O--CK   100   100   000    -    6
>>> 194 Temperature_Celsius     -O---K   100   100   000    -    30
>>> 197 Current_Pending_Sector  -O--C-   100   100   000    -    1288
>>> 199 UDMA_CRC_Error_Count    -OSRCK   100   100   000    -    0
>>> 228 Power-off_Retract_Count -O--CK   100   100   000    -    63889
>>> 232 Available_Reservd_Space PO--CK   084   084   010    -    0
>>> 233 Media_Wearout_Indicator -O--CK   100   100   000    -    0
>>> 241 Total_LBAs_Written      -O--CK   100   100   000    -    20131
>>> 242 Total_LBAs_Read         -O--CK   100   100   000    -    92945
>>> 
>>> The Reallocated_Sector_Ct is increasing about once a minute. I'm not
>>> sure how many reserved sectors the drive has, i.e., how soon before it
>>> starts throwing write IO errors.
>>> 
>>> It's a very young drive, with only 1065 hours on the clock, and has not
>>> even done two full drive-writes:
>>> 
>>> Device Statistics (GP Log 0x04)
>>> Page Offset Size         Value  Description
>>> 1  =====  =                =  == General Statistics (rev 2) ==
>>> 1  0x008  4                7  Lifetime Power-On Resets
>>> 1  0x018  6       1319318736  Logical Sectors Written
>>> 1  0x020  6        137121729  Number of Write Commands
>>> 1  0x028  6       6091245600  Logical Sectors Read
>>> 1  0x030  6        115252407  Number of Read Commands
>>> 
>>> Fortunately this drive is not used as a Ceph journal. It's in a mdraid
>>> RAID5 array :-|
>>> 
>>> Cheers,
>>> Daniel
>>> 
>>> On 03/08/16 07:45, Christian Balzer wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> not a Ceph specific issue, but this is probably the largest sample size of
>>>> SSD users I'm familiar with. ^o^
>>>> 
>>>> This morning I was woken at 4:30 by Nagios, one of our Ceph nodes having a
>>>> religious experience.
>>>> 
>>>> It turns out that the SMART check plugin I run to mostly get an early
>>>> wearout warning detected a "Power_Loss_Cap_Test" failure in one of the
>>>> 200GB DC S3700 used for journals.
>>>> 
>>>> While SMART is of the opinion that this drive is failing and will explode
>>>> spectacularly any moment that particular failure is of little worries to
>>>> me, never mind that I'll eventually replace this unit.
>>>> 
>>>> What brings me here is that this is the first time in over 3 years that an
>>>> Intel SSD has shown a (harmless in this case) problem, so I'm wondering if
>>>> this particular failure has been seen by others.
>>>> 
>>>> That of course entails people actually monitoring for these things. ^o^
>>>> 
>>>> Thanks,
>>>> 
>>>> Christian
>>>> 
>>> 
>>> 
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com