Re: Failed during rebuild (raid5)

Phil Turmel <philip@xxxxxxxxxx> · Mon, 06 May 2013 08:36:34 -0400

Hi Andreas,

You dropped the list.  Please don't do that.  I added it back, and left
the end of the mail untrimmed so the list can see it.

On 05/06/2013 06:54 AM, Andreas Boman wrote:
> On 05/05/2013 11:21 PM, Phil Turmel wrote:
>> Hi Andreas,
>>
>> On 05/05/2013 01:16 PM, Andreas Boman wrote:
>>
>> [trim /]
>>
>>> Turns out the superblocks are there. I ran --examine on the disk instead
>>> of partition. OOps.
>>
>> Please share the "--examine" reports for your array, and "smartctl -x"
>> for each disk, and anything from dmesg/syslog that relates to your array
>> or errors on its members.  (Your original post did say you would be able
>> to get log info.)
> 
> The --examine for the array (as it is now) and smartctl -x for the
> failed disk are at the end of this mail.
> 
> I pasted some log snippets here:  http://pastebin.com/iqnYje1W
> This should be the interesting part:
> 
> May  2 15:50:14 yggdrasil kernel: [    7.247383] md: md127 stopped.
> May  2 15:50:14 yggdrasil kernel: [    7.794697] raid5: allocated 5334kB
> for md127
> May  2 15:50:14 yggdrasil kernel: [    7.794843] md127: detected
> capacity change from 0 to 6001196793856
> May  2 15:50:14 yggdrasil kernel: [    7.796294]  md127: unknown
> partition table
> May  2 15:54:36 yggdrasil kernel: [  287.180692] md: recovery of RAID
> array md127
> May  2 22:40:26 yggdrasil kernel: [24637.888695] raid5:md127: read error
> not correctable (sector 884472576 on sda1).

Current versions of MD raid in the kernel allow multiple read errors per
hour before kicking out a drive.  What kernel and mdadm versions are
involved here?

> Disk was sda at the time, sdb now don't ask why it reorders at times, I
> don't know. Sometimes the on board boot disk is sda, sometimes it is the
> last disk it seems.

You need to document the device names vs. drive S/Ns so you don't mess
up any "--create" operations.  This is one of the reasons "--create
--assume-clean" is so dangerous.

I recommend my own "lsdrv" @ github.com/pturmel.  But an excerpt from
"ls -l /dev/disk/by-id/" will do.

Use of LABEL= and UUID= syntax in fstab and during boot is intended to
mitigate the fact that the kernel cannot guarantee the order it finds
devices during boot.

> I tried to jump ddrescue to the end of the drive to ensure I get the md
> superblock and then live with some lost data after file system repair.
> 
> ./ddrescue -f -d -n -i1499889899520 /dev/sdb /dev/sdf /root/rescue.log
> 
> ^that is what i did (i tried to go further and further). These completed
> every time with no error. Also no superblock copied.

You have a v0.90 array.  The superblock is within 128k of the end of the
partition.

> I'm guessing its some kind of user error that prevents me from copying
> that superblock.

Yes, something destroyed it.

> I'm still trying to determine if bringing the array up (--assemble
> --force) using this disk with the missing data will be just bad or very
> bad? I've been told that mdadm doesn't care, but what will it do when
> data is missing in a chunk on this disk?

Presuming you mean while using the ddrescued copy, then any bad data
will show up in the array's files.  There's no help for that.

>>> After that is done I'll try to get the array up with 4 disks, then add
>>> the spare and have it rebuild. After that I'll add a disk to go to
>>> raid 6.
>>
>> It may be wiser to get it running degraded and take a backup, but that
>> remains to be seen.  You haven't shown that you know why the first
>> rebuild failed.  Until that is understood and addressed, you probably
>> won't succeed in rebuilding onto a spare.

You only shared one "smartctl -x" report.  Please show the others.  If
the others show pending sectors, you will have more difficulty after
rescuing sdb.  (You will need to use ddrescue on the other drives that
show pending sectors.)

/dev/sdb has six pending sectors--unrecoverable read errors that won't
be resolved until those sectors are rewritten.  They might be normal
transient errors that'll be fine after rewrite.  Or they might be
unwritable, and the drive will have to re-allocate them.  You need
regular "check" scrubs in a non-degraded array to catch these early and
fix them.

Since ddrescue is struggling with this disk starting at 884471083, close
to the point where MD kicked it, you might have a large damage area that
can't be rewritten.

> I have been wondering about that, it would be difficult to do (not to
> mention I'd have to buy a bunch of large disks to backup to), but I have
> (am) considered it.

Be careful selecting drives.  The Samsung drive has ERC--you really want
to pick drives that have it.

If I understand correctly, your current plan is to ddrescue sdb, then
assemble degraded (with --force).  I agree with this plan, and I think
you should not need to use "--create --assume-clean".  You will need to
fsck the filesystem before you mount, and accept that some data will be
lost.  Be sure to remove sdb from the system after you've duplicated it,
as two drives with identical metadata will cause problems for MD.

Phil

> 
> Thanks,
> Andreas
> 
> 
> ---------------------metadata
> /dev/sdb1:
>           Magic : a92b4efc
>         Version : 0.90.00
>            UUID : 60b8d5d0:00c342d3:59cb281a:834c72d9
>   Creation Time : Sun Oct  3 06:23:33 2010
>      Raid Level : raid5
>   Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
>      Array Size : 5860543744 (5589.05 GiB 6001.20 GB)
>    Raid Devices : 5
>   Total Devices : 5
> Preferred Minor : 127
> 
>     Update Time : Thu May  2 22:15:22 2013
>           State : clean
>  Active Devices : 4
> Working Devices : 5
>  Failed Devices : 1
>   Spare Devices : 1
>        Checksum : dd5e9120 - correct
>          Events : 1011948
> 
>          Layout : left-symmetric
>      Chunk Size : 64K
> 
>       Number   Major   Minor   RaidDevice State
> this     2       8        1        2      active sync   /dev/sda1
> 
>    0     0       8       33        0      active sync   /dev/sdc1
>    1     1       0        0        1      faulty removed
>    2     2       8        1        2      active sync   /dev/sda1
>    3     3       8       49        3      active sync   /dev/sdd1
>    4     4       8       65        4      active sync   /dev/sde1
>    5     5       8       17        5      spare   /dev/sdb1
> /dev/sdc1:
>           Magic : a92b4efc
>         Version : 0.90.00
>            UUID : 60b8d5d0:00c342d3:59cb281a:834c72d9
>   Creation Time : Sun Oct  3 06:23:33 2010
>      Raid Level : raid5
>   Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
>      Array Size : 5860543744 (5589.05 GiB 6001.20 GB)
>    Raid Devices : 5
>   Total Devices : 5
> Preferred Minor : 127
> 
>     Update Time : Fri May  3 05:31:49 2013
>           State : clean
>  Active Devices : 3
> Working Devices : 4
>  Failed Devices : 2
>   Spare Devices : 1
>        Checksum : dd5ef826 - correct
>          Events : 1012026
> 
>          Layout : left-symmetric
>      Chunk Size : 64K
> 
>       Number   Major   Minor   RaidDevice State
> this     5       8       17        5      spare   /dev/sdb1
> 
>    0     0       8       33        0      active sync   /dev/sdc1
>    1     1       0        0        1      faulty removed
>    2     2       0        0        2      faulty removed
>    3     3       8       49        3      active sync   /dev/sdd1
>    4     4       8       65        4      active sync   /dev/sde1
>    5     5       8       17        5      spare   /dev/sdb1
> /dev/sdd1:
>           Magic : a92b4efc
>         Version : 0.90.00
>            UUID : 60b8d5d0:00c342d3:59cb281a:834c72d9
>   Creation Time : Sun Oct  3 06:23:33 2010
>      Raid Level : raid5
>   Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
>      Array Size : 5860543744 (5589.05 GiB 6001.20 GB)
>    Raid Devices : 5
>   Total Devices : 5
> Preferred Minor : 127
> 
>     Update Time : Fri May  3 05:31:49 2013
>           State : clean
>  Active Devices : 3
> Working Devices : 4
>  Failed Devices : 2
>   Spare Devices : 1
>        Checksum : dd5ef832 - correct
>          Events : 1012026
> 
>          Layout : left-symmetric
>      Chunk Size : 64K
> 
>       Number   Major   Minor   RaidDevice State
> this     0       8       33        0      active sync   /dev/sdc1
> 
>    0     0       8       33        0      active sync   /dev/sdc1
>    1     1       0        0        1      faulty removed
>    2     2       0        0        2      faulty removed
>    3     3       8       49        3      active sync   /dev/sdd1
>    4     4       8       65        4      active sync   /dev/sde1
>    5     5       8       17        5      spare   /dev/sdb1
> /dev/sde1:
>           Magic : a92b4efc
>         Version : 0.90.00
>            UUID : 60b8d5d0:00c342d3:59cb281a:834c72d9
>   Creation Time : Sun Oct  3 06:23:33 2010
>      Raid Level : raid5
>   Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
>      Array Size : 5860543744 (5589.05 GiB 6001.20 GB)
>    Raid Devices : 5
>   Total Devices : 5
> Preferred Minor : 127
> 
>     Update Time : Fri May  3 05:31:49 2013
>           State : clean
>  Active Devices : 3
> Working Devices : 4
>  Failed Devices : 2
>   Spare Devices : 1
>        Checksum : dd5ef848 - correct
>          Events : 1012026
> 
>          Layout : left-symmetric
>      Chunk Size : 64K
> 
>       Number   Major   Minor   RaidDevice State
> this     3       8       49        3      active sync   /dev/sdd1
> 
>    0     0       8       33        0      active sync   /dev/sdc1
>    1     1       0        0        1      faulty removed
>    2     2       0        0        2      faulty removed
>    3     3       8       49        3      active sync   /dev/sdd1
>    4     4       8       65        4      active sync   /dev/sde1
>    5     5       8       17        5      spare   /dev/sdb1
> /dev/sdg1:
>           Magic : a92b4efc
>         Version : 0.90.00
>            UUID : 60b8d5d0:00c342d3:59cb281a:834c72d9
>   Creation Time : Sun Oct  3 06:23:33 2010
>      Raid Level : raid5
>   Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
>      Array Size : 5860543744 (5589.05 GiB 6001.20 GB)
>    Raid Devices : 5
>   Total Devices : 5
> Preferred Minor : 127
> 
>     Update Time : Fri May  3 05:31:49 2013
>           State : clean
>  Active Devices : 3
> Working Devices : 4
>  Failed Devices : 2
>   Spare Devices : 1
>        Checksum : dd5ef85a - correct
>          Events : 1012026
> 
>          Layout : left-symmetric
>      Chunk Size : 64K
> 
>       Number   Major   Minor   RaidDevice State
> this     4       8       65        4      active sync   /dev/sde1
> 
>    0     0       8       33        0      active sync   /dev/sdc1
>    1     1       0        0        1      faulty removed
>    2     2       0        0        2      faulty removed
>    3     3       8       49        3      active sync   /dev/sdd1
>    4     4       8       65        4      active sync   /dev/sde1
>    5     5       8       17        5      spare   /dev/sdb1
> 
> 
> 
> ---------------------smartctl -x
> 
> 
>  smartctl -x /dev/sdb
> smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
> 
> === START OF INFORMATION SECTION ===
> Model Family:     SAMSUNG SpinPoint F2 EG series
> Device Model:     SAMSUNG HD154UI
> Serial Number:    S1Y6J1LZ100168
> Firmware Version: 1AG01118
> User Capacity:    1,500,301,910,016 bytes
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   8
> ATA Standard is:  ATA-8-ACS revision 3b
> Local Time is:    Mon May  6 05:41:26 2013 EDT
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> 
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> 
> General SMART Values:
> Offline data collection status:  (0x00)    Offline data collection activity
>                     was never started.
>                     Auto Offline Data Collection: Disabled.
> Self-test execution status:      ( 114)    The previous self-test
> completed having
>                     the read element of the test failed.
> Total time to complete Offline
> data collection:          (19591) seconds.
> Offline data collection
> capabilities:              (0x7b) SMART execute Offline immediate.
>                     Auto Offline data collection on/off support.
>                     Suspend Offline collection upon new
>                     command.
>                     Offline surface scan supported.
>                     Self-test supported.
>                     Conveyance Self-test supported.
>                     Selective Self-test supported.
> SMART capabilities:            (0x0003)    Saves SMART data before entering
>                     power-saving mode.
>                     Supports SMART auto save timer.
> Error logging capability:        (0x01)    Error logging supported.
>                     General Purpose Logging supported.
> Short self-test routine
> recommended polling time:      (   2) minutes.
> Extended self-test routine
> recommended polling time:      ( 255) minutes.
> Conveyance self-test routine
> recommended polling time:      (  34) minutes.
> SCT capabilities:            (0x003f)    SCT Status supported.
>                     SCT Error Recovery Control supported.
>                     SCT Feature Control supported.
>                     SCT Data Table supported.
> 
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE     
> UPDATED  WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail 
> Always       -       17
>   3 Spin_Up_Time            0x0007   063   063   011    Pre-fail 
> Always       -       11770
>   4 Start_Stop_Count        0x0032   100   100   000    Old_age  
> Always       -       155
>   5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail 
> Always       -       0
>   7 Seek_Error_Rate         0x000f   253   253   051    Pre-fail 
> Always       -       0
>   8 Seek_Time_Performance   0x0025   100   097   015    Pre-fail 
> Offline      -       14926
>   9 Power_On_Hours          0x0032   100   100   000    Old_age  
> Always       -       378
>  10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail 
> Always       -       0
>  11 Calibration_Retry_Count 0x0012   100   100   000    Old_age  
> Always       -       0
>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age  
> Always       -       67
>  13 Read_Soft_Error_Rate    0x000e   100   100   000    Old_age  
> Always       -       17
> 183 Runtime_Bad_Block       0x0032   100   100   000    Old_age  
> Always       -       1
> 184 End-to-End_Error        0x0033   100   100   000    Pre-fail 
> Always       -       0
> 187 Reported_Uncorrect      0x0032   100   100   000    Old_age  
> Always       -       18
> 188 Command_Timeout         0x0032   100   100   000    Old_age  
> Always       -       0
> 190 Airflow_Temperature_Cel 0x0022   075   068   000    Old_age  
> Always       -       25 (Lifetime Min/Max 17/32)
> 194 Temperature_Celsius     0x0022   075   066   000    Old_age  
> Always       -       25 (Lifetime Min/Max 17/34)
> 195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age  
> Always       -       463200379
> 196 Reallocated_Event_Count 0x0032   100   100   000    Old_age  
> Always       -       0
> 197 Current_Pending_Sector  0x0012   100   100   000    Old_age  
> Always       -       6
> 198 Offline_Uncorrectable   0x0030   100   100   000    Old_age  
> Offline      -       1
> 199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age  
> Always       -       0
> 200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age  
> Always       -       0
> 201 Soft_Read_Error_Rate    0x000a   100   100   000    Old_age  
> Always       -       0
> 
> General Purpose Logging (GPL) feature set supported
> General Purpose Log Directory Version 1
> SMART           Log Directory Version 1 [multi-sector log support]
> GP/S  Log at address 0x00 has    1 sectors [Log Directory]
> SMART Log at address 0x01 has    1 sectors [Summary SMART error log]
> SMART Log at address 0x02 has    2 sectors [Comprehensive SMART error log]
> GP    Log at address 0x03 has    2 sectors [Ext. Comprehensive SMART
> error log]
> GP    Log at address 0x04 has    2 sectors [Device Statistics]
> SMART Log at address 0x06 has    1 sectors [SMART self-test log]
> GP    Log at address 0x07 has    2 sectors [Extended self-test log]
> SMART Log at address 0x09 has    1 sectors [Selective self-test log]
> GP    Log at address 0x10 has    1 sectors [NCQ Command Error]
> GP    Log at address 0x11 has    1 sectors [SATA Phy Event Counters]
> GP    Log at address 0x20 has    2 sectors [Streaming performance log]
> GP    Log at address 0x21 has    1 sectors [Write stream error log]
> GP    Log at address 0x22 has    1 sectors [Read stream error log]
> GP/S  Log at address 0x80 has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x81 has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x82 has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x83 has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x84 has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x85 has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x86 has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x87 has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x88 has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x89 has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x8a has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x8b has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x8c has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x8d has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x8e has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x8f has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x90 has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x91 has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x92 has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x93 has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x94 has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x95 has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x96 has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x97 has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x98 has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x99 has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x9a has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x9b has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x9c has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x9d has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x9e has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0x9f has   16 sectors [Host vendor specific log]
> GP/S  Log at address 0xe0 has    1 sectors [SCT Command/Status]
> GP/S  Log at address 0xe1 has    1 sectors [SCT Data Transfer]
> 
> SMART Extended Comprehensive Error Log Version: 1 (2 sectors)
> Device Error Count: 6
>     CR     = Command Register
>     FEATR  = Features Register
>     COUNT  = Count (was: Sector Count) Register
>     LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
>     LH     = LBA High (was: Cylinder High) Register    ]   LBA
>     LM     = LBA Mid (was: Cylinder Low) Register      ] Register
>     LL     = LBA Low (was: Sector Number) Register     ]
>     DV     = Device (was: Device/Head) Register
>     DC     = Device Control Register
>     ER     = Error register
>     ST     = Status register
> Powered_Up_Time is measured from power on, and printed as
> DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
> SS=sec, and sss=millisec. It "wraps" after 49.710 days.
> 
> Error 6 [5] occurred at disk power-on lifetime: 329 hours (13 days + 17
> hours)
>   When the command that caused the error occurred, the device was active
> or idle.
> 
>   After command completion occurred, registers were:
>   ER -- ST COUNT  LBA_48  LH LM LL DV DC
>   -- -- -- == -- == == == -- -- -- -- --
>   00 -- 42 00 00 00 00 34 b7 f5 34 40 00
> 
>   Commands leading to the command that caused the error were:
>   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time 
> Command/Feature_Name
>   -- == -- == -- == == == -- -- -- -- --  --------------- 
> --------------------
>   60 00 08 01 00 00 00 00 b7 f4 3f 40 00     06:51:19.350  READ FPDMA
> QUEUED
>   60 00 00 01 00 00 00 00 b7 f5 3f 40 00     06:51:19.350  READ FPDMA
> QUEUED
>   60 00 08 01 00 00 00 00 b7 f7 3f 40 00     06:51:19.350  READ FPDMA
> QUEUED
>   60 00 70 01 00 00 00 00 b7 f6 3f 40 00     06:51:19.350  READ FPDMA
> QUEUED
>   60 00 08 01 00 00 00 00 b7 f8 3f 40 00     06:51:19.350  READ FPDMA
> QUEUED
> 
> Error 5 [4] occurred at disk power-on lifetime: 329 hours (13 days + 17
> hours)
>   When the command that caused the error occurred, the device was active
> or idle.
> 
>   After command completion occurred, registers were:
>   ER -- ST COUNT  LBA_48  LH LM LL DV DC
>   -- -- -- == -- == == == -- -- -- -- --
>   00 -- 42 00 00 00 00 34 b7 f5 35 40 00
> 
>   Commands leading to the command that caused the error were:
>   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time 
> Command/Feature_Name
>   -- == -- == -- == == == -- -- -- -- --  --------------- 
> --------------------
>   60 00 08 01 00 00 00 00 b7 fb 3f 40 00     06:51:13.030  READ FPDMA
> QUEUED
>   60 00 00 01 00 00 00 00 b7 fa 3f 40 00     06:51:13.030  READ FPDMA
> QUEUED
>   60 00 08 00 e8 00 00 00 b7 f9 57 40 00     06:51:13.030  READ FPDMA
> QUEUED
>   60 00 70 00 18 00 00 00 b7 f9 3f 40 00     06:51:13.030  READ FPDMA
> QUEUED
>   60 00 08 01 00 00 00 00 b7 f8 3f 40 00     06:51:13.030  READ FPDMA
> QUEUED
> 
> Error 4 [3] occurred at disk power-on lifetime: 329 hours (13 days + 17
> hours)
>   When the command that caused the error occurred, the device was active
> or idle.
> 
>   After command completion occurred, registers were:
>   ER -- ST COUNT  LBA_48  LH LM LL DV DC
>   -- -- -- == -- == == == -- -- -- -- --
>   00 -- 42 00 00 00 00 34 b7 f5 33 40 00
> 
>   Commands leading to the command that caused the error were:
>   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time 
> Command/Feature_Name
>   -- == -- == -- == == == -- -- -- -- --  --------------- 
> --------------------
>   60 00 08 01 00 00 00 00 b7 f4 3f 40 00     06:51:08.060  READ FPDMA
> QUEUED
>   60 00 00 01 00 00 00 00 b7 f5 3f 40 00     06:51:08.060  READ FPDMA
> QUEUED
>   60 00 08 01 00 00 00 00 b7 f7 3f 40 00     06:51:08.060  READ FPDMA
> QUEUED
>   60 00 70 01 00 00 00 00 b7 f6 3f 40 00     06:51:08.060  READ FPDMA
> QUEUED
>   60 00 08 01 00 00 00 00 b7 f8 3f 40 00     06:51:08.060  READ FPDMA
> QUEUED
> 
> Error 3 [2] occurred at disk power-on lifetime: 329 hours (13 days + 17
> hours)
>   When the command that caused the error occurred, the device was active
> or idle.
> 
>   After command completion occurred, registers were:
>   ER -- ST COUNT  LBA_48  LH LM LL DV DC
>   -- -- -- == -- == == == -- -- -- -- --
>   00 -- 42 00 00 00 00 34 b7 f5 31 40 00
> 
>   Commands leading to the command that caused the error were:
>   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time 
> Command/Feature_Name
>   -- == -- == -- == == == -- -- -- -- --  --------------- 
> --------------------
>   60 00 08 01 00 00 00 00 b7 fb 3f 40 00     06:51:03.650  READ FPDMA
> QUEUED
>   60 00 00 01 00 00 00 00 b7 fa 3f 40 00     06:51:03.650  READ FPDMA
> QUEUED
>   60 00 08 00 e8 00 00 00 b7 f9 57 40 00     06:51:03.650  READ FPDMA
> QUEUED
>   60 00 70 00 18 00 00 00 b7 f9 3f 40 00     06:51:03.650  READ FPDMA
> QUEUED
>   60 00 08 01 00 00 00 00 b7 f8 3f 40 00     06:51:03.650  READ FPDMA
> QUEUED
> 
> Error 2 [1] occurred at disk power-on lifetime: 329 hours (13 days + 17
> hours)
>   When the command that caused the error occurred, the device was active
> or idle.
> 
>   After command completion occurred, registers were:
>   ER -- ST COUNT  LBA_48  LH LM LL DV DC
>   -- -- -- == -- == == == -- -- -- -- --
>   00 -- 42 00 00 00 00 34 b7 f5 34 40 00
> 
>   Commands leading to the command that caused the error were:
>   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time 
> Command/Feature_Name
>   -- == -- == -- == == == -- -- -- -- --  --------------- 
> --------------------
>   60 00 08 01 00 00 00 00 b7 f4 3f 40 00     06:50:58.240  READ FPDMA
> QUEUED
>   60 00 00 01 00 00 00 00 b7 f5 3f 40 00     06:50:58.240  READ FPDMA
> QUEUED
>   60 00 08 01 00 00 00 00 b7 f7 3f 40 00     06:50:58.240  READ FPDMA
> QUEUED
>   60 00 70 01 00 00 00 00 b7 f6 3f 40 00     06:50:58.240  READ FPDMA
> QUEUED
>   60 00 08 01 00 00 00 00 b7 f8 3f 40 00     06:50:58.240  READ FPDMA
> QUEUED
> 
> Error 1 [0] occurred at disk power-on lifetime: 329 hours (13 days + 17
> hours)
>   When the command that caused the error occurred, the device was active
> or idle.
> 
>   After command completion occurred, registers were:
>   ER -- ST COUNT  LBA_48  LH LM LL DV DC
>   -- -- -- == -- == == == -- -- -- -- --
>   00 -- 42 00 00 00 00 34 b7 f5 2f 40 00
> 
>   Commands leading to the command that caused the error were:
>   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time 
> Command/Feature_Name
>   -- == -- == -- == == == -- -- -- -- --  --------------- 
> --------------------
>   60 00 00 01 00 00 00 00 b7 f4 3f 40 00     06:50:53.300  READ FPDMA
> QUEUED
>   60 00 08 01 00 00 00 00 b7 f3 3f 40 00     06:50:53.280  READ FPDMA
> QUEUED
>   60 00 00 01 00 00 00 00 b7 f2 3f 40 00     06:50:53.280  READ FPDMA
> QUEUED
>   60 00 08 00 e8 00 00 00 b7 f1 57 40 00     06:50:53.280  READ FPDMA
> QUEUED
>   60 00 00 00 18 00 00 00 b7 f1 3f 40 00     06:50:53.270  READ FPDMA
> QUEUED
> 
> SMART Extended Self-test Log Version: 1 (2 sectors)
> Num  Test_Description    Status                  Remaining 
> LifeTime(hours)  LBA_of_first_error
> # 1  Short offline       Completed: read failure       20%      
> 375         884471093
> # 2  Short offline       Completed: read failure       20%      
> 351         884471083
> # 3  Extended offline    Completed: read failure       90%      
> 337         884471094
> # 4  Short offline       Completed: read failure       20%      
> 333         884471093
> # 5  Short offline       Completed without error       00%      
> 309         -
> # 6  Short offline       Completed without error       00%      
> 285         -
> # 7  Short offline       Completed without error       00%      
> 261         -
> # 8  Short offline       Completed without error       00%      
> 237         -
> # 9  Short offline       Completed without error       00%      
> 213         -
> #10  Extended offline    Completed: read failure       60%      
> 203         884471093
> #11  Short offline       Completed without error       00%      
> 190         -
> #12  Short offline       Completed without error       00%      
> 166         -
> #13  Short offline       Completed without error       00%      
> 142         -
> #14  Short offline       Completed without error       00%      
> 118         -
> #15  Short offline       Completed without error       00%       
> 94         -
> #16  Short offline       Completed without error       00%       
> 70         -
> 
> SMART Selective self-test log data structure revision number 1
>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>     1        0        0  Not_testing
>     2        0        0  Not_testing
>     3        0        0  Not_testing
>     4        0        0  Not_testing
>     5        0        0  Not_testing
> Selective self-test flags (0x0):
>   After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
> 
> SCT Status Version:                  2
> SCT Version (vendor specific):       256 (0x0100)
> SCT Support Level:                   1
> Device State:                        Active (0)
> Current Temperature:                 25 Celsius
> Power Cycle Max Temperature:         34 Celsius
> Lifetime    Max Temperature:         40 Celsius
> SCT Temperature History Version:     2
> Temperature Sampling Period:         1 minute
> Temperature Logging Interval:        1 minute
> Min/Max recommended Temperature:     -4/72 Celsius
> Min/Max Temperature Limit:           -9/77 Celsius
> Temperature History Size (Index):    128 (113)
> 
> Index    Estimated Time   Temperature Celsius
>  114    2013-05-06 03:34    26  *******
>  115    2013-05-06 03:35    25  ******
>  116    2013-05-06 03:36    26  *******
>  ...    ..(  3 skipped).    ..  *******
>  120    2013-05-06 03:40    26  *******
>  121    2013-05-06 03:41    25  ******
>  122    2013-05-06 03:42    25  ******
>  123    2013-05-06 03:43    26  *******
>  124    2013-05-06 03:44    25  ******
>  ...    ..(  4 skipped).    ..  ******
>    1    2013-05-06 03:49    25  ******
>    2    2013-05-06 03:50    26  *******
>    3    2013-05-06 03:51    26  *******
>    4    2013-05-06 03:52    25  ******
>    5    2013-05-06 03:53    25  ******
>    6    2013-05-06 03:54    25  ******
>    7    2013-05-06 03:55    26  *******
>  ...    ..(  2 skipped).    ..  *******
>   10    2013-05-06 03:58    26  *******
>   11    2013-05-06 03:59    25  ******
>   12    2013-05-06 04:00    26  *******
>   13    2013-05-06 04:01    25  ******
>   14    2013-05-06 04:02    25  ******
>   15    2013-05-06 04:03    26  *******
>   16    2013-05-06 04:04    25  ******
>  ...    ..(  4 skipped).    ..  ******
>   21    2013-05-06 04:09    25  ******
>   22    2013-05-06 04:10    26  *******
>   23    2013-05-06 04:11    25  ******
>  ...    ..( 89 skipped).    ..  ******
>  113    2013-05-06 05:41    25  ******
> 
> SCT Error Recovery Control:
>            Read:     70 (7.0 seconds)
>           Write:     70 (7.0 seconds)
> 
> SATA Phy Event Counters (GP Log 0x11)
> ID      Size     Value  Description
> 0x000a  2            7  Device-to-host register FISes sent due to a
> COMRESET
> 0x0001  2            0  Command failed due to ICRC error
> 0x0002  2            0  R_ERR response for data FIS
> 0x0003  2            0  R_ERR response for device-to-host data FIS
> 0x0004  2            0  R_ERR response for host-to-device data FIS
> 0x0005  2            0  R_ERR response for non-data FIS
> 0x0006  2            0  R_ERR response for device-to-host non-data FIS
> 0x0007  2            0  R_ERR response for host-to-device non-data FIS
> 0x0008  2            0  Device-to-host non-data FIS retries
> 0x0009  2            7  Transition from drive PhyRdy to drive PhyNRdy
> 0x000b  2            0  CRC errors within host-to-device FIS
> 0x000d  2            0  Non-CRC errors within host-to-device FIS
> 0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
> 0x0010  2            0  R_ERR response for host-to-device data FIS, non-CRC
> 0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
> 0x0013  2            0  R_ERR response for host-to-device non-data FIS,
> non-CRC
> 
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html