Make sure you are reading the right attribute and interpreting it right. update-smart-drivedb sometimes makes wonders :) I wonder what isdct tool would say the drive's life expectancy is with this workload? Are you really writing ~600TB/month?? Jan > On 03 Aug 2016, at 12:06, Maxime Guyot <Maxime.Guyot@xxxxxxxxx> wrote: > > Hi, > > I haven’t had problems with Power_Loss_Cap_Test so far. > > Regarding Reallocated_Sector_Ct (SMART ID: 5/05h), you can check the “Available Reserved Space” (SMART ID: 232/E8h), the data sheet (http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3610-spec.pdf) reads: > "This attribute reports the number of reserve blocks > > remaining. The normalized value begins at 100 (64h), > which corresponds to 100 percent availability of the > reserved space. The threshold value for this attribute is > 10 percent availability." > > According to the SMART data you copied, it should be about 84% of the over provisioning left? Since the drive is pretty young, it might be some form of defect? > I have a number of S3610 with ~150 DW, all SMART counters are their initial values (except for the temperature). > > Cheers, > Maxime > > > > > > > > > On 03/08/16 11:12, "ceph-users on behalf of Daniel Swarbrick" <ceph-users-bounces@xxxxxxxxxxxxxx on behalf of daniel.swarbrick@xxxxxxxxxxxxxxxx> wrote: > >> Hi Christian, >> >> Intel drives are good, but apparently not infallible. I'm watching a DC >> S3610 480GB die from reallocated sectors. >> >> ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE >> 5 Reallocated_Sector_Ct -O--CK 081 081 000 - 756 >> 9 Power_On_Hours -O--CK 100 100 000 - 1065 >> 12 Power_Cycle_Count -O--CK 100 100 000 - 7 >> 175 Program_Fail_Count_Chip PO--CK 100 100 010 - 17454078318 >> 183 Runtime_Bad_Block -O--CK 100 100 000 - 0 >> 184 End-to-End_Error PO--CK 100 100 090 - 0 >> 187 Reported_Uncorrect -O--CK 100 100 000 - 0 >> 190 Airflow_Temperature_Cel -O---K 070 065 000 - 30 (Min/Max >> 25/35) >> 192 Power-Off_Retract_Count -O--CK 100 100 000 - 6 >> 194 Temperature_Celsius -O---K 100 100 000 - 30 >> 197 Current_Pending_Sector -O--C- 100 100 000 - 1288 >> 199 UDMA_CRC_Error_Count -OSRCK 100 100 000 - 0 >> 228 Power-off_Retract_Count -O--CK 100 100 000 - 63889 >> 232 Available_Reservd_Space PO--CK 084 084 010 - 0 >> 233 Media_Wearout_Indicator -O--CK 100 100 000 - 0 >> 241 Total_LBAs_Written -O--CK 100 100 000 - 20131 >> 242 Total_LBAs_Read -O--CK 100 100 000 - 92945 >> >> The Reallocated_Sector_Ct is increasing about once a minute. I'm not >> sure how many reserved sectors the drive has, i.e., how soon before it >> starts throwing write IO errors. >> >> It's a very young drive, with only 1065 hours on the clock, and has not >> even done two full drive-writes: >> >> Device Statistics (GP Log 0x04) >> Page Offset Size Value Description >> 1 ===== = = == General Statistics (rev 2) == >> 1 0x008 4 7 Lifetime Power-On Resets >> 1 0x018 6 1319318736 Logical Sectors Written >> 1 0x020 6 137121729 Number of Write Commands >> 1 0x028 6 6091245600 Logical Sectors Read >> 1 0x030 6 115252407 Number of Read Commands >> >> Fortunately this drive is not used as a Ceph journal. It's in a mdraid >> RAID5 array :-| >> >> Cheers, >> Daniel >> >> On 03/08/16 07:45, Christian Balzer wrote: >>> >>> Hello, >>> >>> not a Ceph specific issue, but this is probably the largest sample size of >>> SSD users I'm familiar with. ^o^ >>> >>> This morning I was woken at 4:30 by Nagios, one of our Ceph nodes having a >>> religious experience. >>> >>> It turns out that the SMART check plugin I run to mostly get an early >>> wearout warning detected a "Power_Loss_Cap_Test" failure in one of the >>> 200GB DC S3700 used for journals. >>> >>> While SMART is of the opinion that this drive is failing and will explode >>> spectacularly any moment that particular failure is of little worries to >>> me, never mind that I'll eventually replace this unit. >>> >>> What brings me here is that this is the first time in over 3 years that an >>> Intel SSD has shown a (harmless in this case) problem, so I'm wondering if >>> this particular failure has been seen by others. >>> >>> That of course entails people actually monitoring for these things. ^o^ >>> >>> Thanks, >>> >>> Christian >>> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com