Re: fixing a box where the hard disc may failed

On Thu, 2006-09-21 at 13:24 +0100, James Wilkinson wrote: 
> Since you say that this is a "scratch" test PC, I'd do a 
> smartctl -H /dev/hda 
> (which is probably what I should have told you in the first 
> place). If that says "PASSED", I'd do a combination of 
> dd if=/dev/zero of=/dev/hda 
> to blank the drive (that should remap all the bad sectors), and 
> dd if=/dev/hda of=/dev/null 
> to read them all back. Then check for any more errors. If you 
> don't get any, I'd trust the drive for testing purposes.
> Those dd commands will probably take several hours.

Um, no actually.  Under an hour, 'twas only a 15 gig drive.  I did a
quick test of seeing what what happen if I did dd to the drive that the
computer had booted from.  Watched it working, went away, came back to a
black screen (about what I expected).  Then I took the drive out and put
it into another box; results below.

[root@box ~]# dd if=/dev/zero of=/dev/hdc
dd: writing to `/dev/hdc': Input/output error
23953097+0 records in
23953096+0 records out

Above is as I'd expect.  Below, seems about right (same output count as
input, same number as worked above, and an error).  I'm not sure at what
stage a bad block gets mapped out of use.  In the past, I'd have done
that while prepping/formatting a drive. 

[root@box ~]# dd if=/dev/hdc of=/dev/null
dd: reading `/dev/hdc': Input/output error
23952864+0 records in
23952864+0 records out

Then did a "smartctl -t short /dev/hdc" looked at the results, then a 
"smartctl -t long /dev/hdc", results after both further below.  The
basic health check showed fine:

[root@box ~]# smartctl -H /dev/hdc
smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is

SMART overall-health self-assessment test result: PASSED

So that looks okay.  But the "smartctl -a /dev/hdc" is less inspiring:

[root@box ~]# smartctl -a /dev/hdc
smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is

Device Model:     WDC WD153AA-00BAA0
Serial Number:    WD-WMA2L2483801
Firmware Version: 10.09K11
User Capacity:    15,393,079,296 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   4
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sat Sep 23 19:13:27 2006 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 121) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline
data collection:                 (1040) seconds.
Offline data collection
capabilities:                    (0x1b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        No General Purpose Logging support.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  14) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
  1 Raw_Read_Error_Rate     0x000b   197   098   051    Pre-fail  Always       -       45
  3 Spin_Up_Time            0x0006   109   104   000    Old_age   Always       -       1150
  4 Start_Stop_Count        0x0012   098   098   040    Old_age   Always       -       2524
  5 Reallocated_Sector_Ct   0x0012   198   198   112    Old_age   Always       -       5
  9 Power_On_Hours          0x0012   065   065   000    Old_age   Always       -       26136
 10 Spin_Retry_Count        0x0013   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0013   100   100   051    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0012   098   098   000    Old_age   Always       -       2297
196 Reallocated_Event_Count 0x0012   196   196   000    Old_age   Always       -       4
197 Current_Pending_Sector  0x0012   200   199   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0012   100   253   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 572 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 572 occurred at disk power-on lifetime: 1013 hours (42 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
  40 51 18 cd 7e 6d e1  Error: UNC 24 sectors at LBA = 0x016d7ecd = 23953101

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 18 c8 7e 6d e1 00      00:57:28.650  READ DMA
  c8 00 20 c0 7e 6d e1 00      00:57:22.800  READ DMA
  c8 00 28 b8 7e 6d e1 00      00:57:16.700  READ DMA
  c8 00 30 b0 7e 6d e1 00      00:57:10.750  READ DMA
  c8 00 38 a8 7e 6d e1 00      00:57:04.750  READ DMA

Error 571 occurred at disk power-on lifetime: 1013 hours (42 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
  40 51 20 cd 7e 6d e1  Error: UNC 32 sectors at LBA = 0x016d7ecd = 23953101

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 20 c0 7e 6d e1 00      00:57:22.800  READ DMA
  c8 00 28 b8 7e 6d e1 00      00:57:16.700  READ DMA
  c8 00 30 b0 7e 6d e1 00      00:57:10.750  READ DMA
  c8 00 38 a8 7e 6d e1 00      00:57:04.750  READ DMA
  c8 00 40 a0 7e 6d e1 00      00:56:58.850  READ DMA

Error 570 occurred at disk power-on lifetime: 1013 hours (42 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
  40 51 28 cd 7e 6d e1  Error: UNC 40 sectors at LBA = 0x016d7ecd = 23953101

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 28 b8 7e 6d e1 00      00:57:16.700  READ DMA
  c8 00 30 b0 7e 6d e1 00      00:57:10.750  READ DMA
  c8 00 38 a8 7e 6d e1 00      00:57:04.750  READ DMA
  c8 00 40 a0 7e 6d e1 00      00:56:58.850  READ DMA
  c8 00 48 98 7e 6d e1 00      00:56:53.050  READ DMA

Error 569 occurred at disk power-on lifetime: 1013 hours (42 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
  40 51 30 cd 7e 6d e1  Error: UNC 48 sectors at LBA = 0x016d7ecd = 23953101

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 30 b0 7e 6d e1 00      00:57:10.750  READ DMA
  c8 00 38 a8 7e 6d e1 00      00:57:04.750  READ DMA
  c8 00 40 a0 7e 6d e1 00      00:56:58.850  READ DMA
  c8 00 48 98 7e 6d e1 00      00:56:53.050  READ DMA
  c8 00 50 90 7e 6d e1 00      00:56:47.350  READ DMA

Error 568 occurred at disk power-on lifetime: 1013 hours (42 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
  40 51 38 cd 7e 6d e1  Error: UNC 56 sectors at LBA = 0x016d7ecd = 23953101

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 38 a8 7e 6d e1 00      00:57:04.750  READ DMA
  c8 00 40 a0 7e 6d e1 00      00:56:58.850  READ DMA
  c8 00 48 98 7e 6d e1 00      00:56:53.050  READ DMA
  c8 00 50 90 7e 6d e1 00      00:56:47.350  READ DMA
  c8 00 58 88 7e 6d e1 00      00:56:41.550  READ DMA

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      1014         23953101
# 2  Short offline       Completed: read failure       90%      1013         23953101
# 3  Extended offline    Completed: read failure       30%       990         23953101
# 4  Short offline       Completed without error       00%       990         -
# 5  Short offline       Completed without error       00%       327         -
# 6  Short offline       Completed without error       00%        93         -
# 7  Short captive       Completed without error       00%         0         -

Device does not support Selective Self Tests/Logging

Tests #1 & #2 are after the dd experiment, the rest are from before.  A
quick perusal of information doesn't give me any clues as to what the
remaining and lifetime columns mean.  Predicted failure time, uptime?

(Currently running FC4, occasionally trying FC5.)

Don't send private replies to my address, the mailbox is ignored.
I read messages from the public lists.

