Re: No I/O errors reported after SATA link hard reset

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Il 26-08-2017 22:58 sonofagun@xxxxxxxxxxxxxxx ha scritto:
Hello guys, this is a very interesting thread but I will join it tomorrow!

I have read a similar discussion for SSDs some time ago. That took
place here [1]. Corruption of such devices can lead to complete data
loss and not just corruption.

I just read the thread at https://marc.info/?t=149186660400002&r=1&w=2, it was very interesting. However, it seems to me that it ended without a clear solution, right?

Anyway, the opacity of the FTL (flash translation layer) surely is a significant cause of concern/danger. Unexpected power losses can wreak havock on SSDs.

Please install smartmontools and post its output here for each disk so
that I can see if your disks are healthy. Also I must see their
firmware version as there might be a firmware update available.

Fortunately, the issue is solved now: I tracked back it to a faulty SATA power cable. However, the SMART reports of both disk is very interesting:


GOOD DISK (sda):
[root@nas ~]# smartctl -A /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-514.el7.x86_64] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 111 099 006 Pre-fail Always - 30483624 3 Spin_Up_Time 0x0003 093 091 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 46 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 077 060 030 Pre-fail Always - 55353954 9 Power_On_Hours 0x0032 091 091 000 Old_age Always - 8535 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 44 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 067 060 045 Old_age Always - 33 (Min/Max 30/40) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 24 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 67 194 Temperature_Celsius 0x0022 033 040 000 Old_age Always - 33 (0 14 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0

Note the low (expected) Start_Stop_Count (46)


BAD DISK (sdb):
[root@nas ~]# smartctl -A /dev/sdb
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-514.el7.x86_64] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 106 099 006 Pre-fail Always - 11030016 3 Spin_Up_Time 0x0003 095 091 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 661 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 078 060 030 Pre-fail Always - 60912204 9 Power_On_Hours 0x0032 091 091 000 Old_age Always - 8536 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 44 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 067 061 045 Old_age Always - 33 (Min/Max 29/39) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 639 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 672 194 Temperature_Celsius 0x0022 033 040 000 Old_age Always - 33 (0 14 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0

Note the *much* higher Start_Stop_Count (661); however, the Power_Cycle_Count was the same (44).

So yes, while HDDs surely are more resilient than SSDs to unexpected power losses, a micro-powerloss which corrupt/invalidate the disk's cache content without giving the host a change to notice *will* cause data corruption, sometime on acked syncronized writes also (I had a filesystem journal corruption).

However, as stated in this thread, SATA does not really has a provision to detect failed command due to micro-powerlosses nor to detect and invalid/corrupted disk cache. So it seems the better "line of defese" is to monitor (via SMART) the start/stop or power cycles count.

Regards.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@xxxxxxxxxx - info@xxxxxxxxxx
GPG public key ID: FF5F32A8



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]

  Powered by Linux