I'm a fool, I miscalculated the writes by a factor of 1000 of course :-) 600GB/month is not much for S36xx at all, must be some sort of defect then... Jan > On 03 Aug 2016, at 12:15, Jan Schermer <jan@xxxxxxxxxxx> wrote: > > Make sure you are reading the right attribute and interpreting it right. > update-smart-drivedb sometimes makes wonders :) > > I wonder what isdct tool would say the drive's life expectancy is with this workload? Are you really writing ~600TB/month?? > > Jan > > >> On 03 Aug 2016, at 12:06, Maxime Guyot <Maxime.Guyot@xxxxxxxxx> wrote: >> >> Hi, >> >> I haven’t had problems with Power_Loss_Cap_Test so far. >> >> Regarding Reallocated_Sector_Ct (SMART ID: 5/05h), you can check the “Available Reserved Space” (SMART ID: 232/E8h), the data sheet (http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3610-spec.pdf) reads: >> "This attribute reports the number of reserve blocks >> >> remaining. The normalized value begins at 100 (64h), >> which corresponds to 100 percent availability of the >> reserved space. The threshold value for this attribute is >> 10 percent availability." >> >> According to the SMART data you copied, it should be about 84% of the over provisioning left? Since the drive is pretty young, it might be some form of defect? >> I have a number of S3610 with ~150 DW, all SMART counters are their initial values (except for the temperature). >> >> Cheers, >> Maxime >> >> >> >> >> >> >> >> >> On 03/08/16 11:12, "ceph-users on behalf of Daniel Swarbrick" <ceph-users-bounces@xxxxxxxxxxxxxx on behalf of daniel.swarbrick@xxxxxxxxxxxxxxxx> wrote: >> >>> Hi Christian, >>> >>> Intel drives are good, but apparently not infallible. I'm watching a DC >>> S3610 480GB die from reallocated sectors. >>> >>> ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE >>> 5 Reallocated_Sector_Ct -O--CK 081 081 000 - 756 >>> 9 Power_On_Hours -O--CK 100 100 000 - 1065 >>> 12 Power_Cycle_Count -O--CK 100 100 000 - 7 >>> 175 Program_Fail_Count_Chip PO--CK 100 100 010 - 17454078318 >>> 183 Runtime_Bad_Block -O--CK 100 100 000 - 0 >>> 184 End-to-End_Error PO--CK 100 100 090 - 0 >>> 187 Reported_Uncorrect -O--CK 100 100 000 - 0 >>> 190 Airflow_Temperature_Cel -O---K 070 065 000 - 30 (Min/Max >>> 25/35) >>> 192 Power-Off_Retract_Count -O--CK 100 100 000 - 6 >>> 194 Temperature_Celsius -O---K 100 100 000 - 30 >>> 197 Current_Pending_Sector -O--C- 100 100 000 - 1288 >>> 199 UDMA_CRC_Error_Count -OSRCK 100 100 000 - 0 >>> 228 Power-off_Retract_Count -O--CK 100 100 000 - 63889 >>> 232 Available_Reservd_Space PO--CK 084 084 010 - 0 >>> 233 Media_Wearout_Indicator -O--CK 100 100 000 - 0 >>> 241 Total_LBAs_Written -O--CK 100 100 000 - 20131 >>> 242 Total_LBAs_Read -O--CK 100 100 000 - 92945 >>> >>> The Reallocated_Sector_Ct is increasing about once a minute. I'm not >>> sure how many reserved sectors the drive has, i.e., how soon before it >>> starts throwing write IO errors. >>> >>> It's a very young drive, with only 1065 hours on the clock, and has not >>> even done two full drive-writes: >>> >>> Device Statistics (GP Log 0x04) >>> Page Offset Size Value Description >>> 1 ===== = = == General Statistics (rev 2) == >>> 1 0x008 4 7 Lifetime Power-On Resets >>> 1 0x018 6 1319318736 Logical Sectors Written >>> 1 0x020 6 137121729 Number of Write Commands >>> 1 0x028 6 6091245600 Logical Sectors Read >>> 1 0x030 6 115252407 Number of Read Commands >>> >>> Fortunately this drive is not used as a Ceph journal. It's in a mdraid >>> RAID5 array :-| >>> >>> Cheers, >>> Daniel >>> >>> On 03/08/16 07:45, Christian Balzer wrote: >>>> >>>> Hello, >>>> >>>> not a Ceph specific issue, but this is probably the largest sample size of >>>> SSD users I'm familiar with. ^o^ >>>> >>>> This morning I was woken at 4:30 by Nagios, one of our Ceph nodes having a >>>> religious experience. >>>> >>>> It turns out that the SMART check plugin I run to mostly get an early >>>> wearout warning detected a "Power_Loss_Cap_Test" failure in one of the >>>> 200GB DC S3700 used for journals. >>>> >>>> While SMART is of the opinion that this drive is failing and will explode >>>> spectacularly any moment that particular failure is of little worries to >>>> me, never mind that I'll eventually replace this unit. >>>> >>>> What brings me here is that this is the first time in over 3 years that an >>>> Intel SSD has shown a (harmless in this case) problem, so I'm wondering if >>>> this particular failure has been seen by others. >>>> >>>> That of course entails people actually monitoring for these things. ^o^ >>>> >>>> Thanks, >>>> >>>> Christian >>>> >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com