Re: Brocken Raid & LUKS

Phil Turmel <philip@xxxxxxxxxx> · Tue, 19 Feb 2013 19:31:43 -0500

You forgot to include linux-raid again.  I'm adding them back to the
CC:.  Please always use "reply to all" in your email client.

I will look for your detailed reply tomorrow.

Phil

On 02/19/2013 05:23 PM, Stone wrote:
> Am 19.02.2013 23:08, schrieb Phil Turmel:
>> On 02/19/2013 04:31 PM, Stone wrote:
>>
>> [trim /]
>>
>>>> [trim /]
>>> ok. my system is a ubuntu 12.04
>>> i can install a older mdadm or a install a old ubuntu like 11.04. there
>>> is a older mdadm on board.
>> Using the older ubuntu as a LiveCD should be fine--you don't have to
>> uninistall your current system.
>>
>> [trim /]
>>
>>> ok. here my next steps
>>> i find a older mdadm or i install a older ubunt with an older mdadm on
>>> board.
>>> then i stop my md2 device and recreate it with: mdadm --create /dev/md2
>>> --assume-clean --verbose --level=5 --raid-devices=4 /dev/sdc1 /dev/sdd1
>>> missing /dev/sdf1
>> Yes.  But read all the way through first....
>>
>>> with a little bit of hope i can open the device.
>> But *don't* mount it!  Use "fsck -n" after you open it to verify it is
>> Ok.  If you mount it, and the chunk size is wrong, it will damage your
>> encrypted filesystem.
>>
>>> if not. i stop the md2 and recreate it with? with the parameter chunk?
>>> and with what value? do you have a range for me?
>> The current default is 512.  The old default was 64.  I'd try that if
>> 512 doesn't work.  After that you'll have to guess.
> Ok i will test this tomorrow.
>>> here the timeout infos:
>>> for x in /sys/block/sd*/device/timeout ; do echo $x ; cat $x ; done
>>> /sys/block/sda/device/timeout
>>> 30
>>> /sys/block/sdb/device/timeout
>>> 30
>>> /sys/block/sdc/device/timeout
>>> 30
>>> /sys/block/sdd/device/timeout
>>> 30
>>> /sys/block/sde/device/timeout
>>> 30
>>> /sys/block/sdf/device/timeout
>>> 30
>> Ok, these are all Linux default.  30 seconds.
>>
>>> here the smart infos:
>> Uh oh.  Two serious issues:
>>
>>> smartctl -x /dev/sdc1
>>> smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-23-generic] (local
>>> build)
>>> Copyright (C) 2002-11 by Bruce Allen,
>>> http://smartmontools.sourceforge.net
>> [trim /]
>>
>>>    5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
>>>    7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
>>>    9 Power_On_Hours          -O--CK   078   078   000    -    16219
>>>   10 Spin_Retry_Count        -O--CK   100   100   000    -    0
>>>   11 Calibration_Retry_Count -O--CK   100   253   000    -    0
>>>   12 Power_Cycle_Count       -O--CK   100   100   000    -    84
>>> 192 Power-Off_Retract_Count -O--CK   200   200   000    -    82
>>> 193 Load_Cycle_Count        -O--CK   169   169   000    -    94419
>>> 194 Temperature_Celsius     -O---K   114   106   000    -    36
>>> 196 Reallocated_Event_Count -O--CK   200   200   000    -    0
>>> 197 Current_Pending_Sector  -O--CK   200   200   000    -    2
>> Serious issue #1:
>>
>> You have unreadable sectors on sdc.  When you hit them during rebuild,
>> sdc will be kicked out (again).  They might not be permanent errors, but
>> you can't tell until the drive is given fresh data to write over them.
>>
>> You have two choices:
>>
>> 1) use ddrescue to copy sdc onto a new drive, then use it in place of
>> sdc when you re-create the array, or
>>
>> 2) use badblocks to find the exact locations of the bad sectors, then
>> write zeros to those sectors using dd.
>>
>> Either way, you have lost whatever those sectors used to hold.
>>
>> [trim /]
> yes this cheep WD Green drives. i have 4 new better drives here the i
> will use instead. this means i will get the raid running and than i copy
> all the data on the new drives.
>>> SCT Status Version:                  3
>>> SCT Version (vendor specific):       258 (0x0102)
>>> SCT Support Level:                   1
>>> Device State:                        Active (0)
>>> Current Temperature:                    36 Celsius
>>> Power Cycle Min/Max Temperature:     33/37 Celsius
>>> Lifetime    Min/Max Temperature:     33/44 Celsius
>>> Under/Over Temperature Limit Count:   0/0
>>> SCT Temperature History Version:     2
>>> Temperature Sampling Period:         1 minute
>>> Temperature Logging Interval:        1 minute
>>> Min/Max recommended Temperature:      0/60 Celsius
>>> Min/Max Temperature Limit:           -41/85 Celsius
>>> Temperature History Size (Index):    478 (314)
>>>
>>> Index    Estimated Time   Temperature Celsius
>>>   315    2013-02-19 14:26    36  *****************
>>>   ...    ..(476 skipped).    ..  *****************
>>>   314    2013-02-19 22:23    36  *****************
>>>
>>> Warning: device does not support SCT Error Recovery Control command
>> Serious issue #2:
>>
>> Error timeout mismatch.  Your cheap drives do not support Error Recovery
>> Control.  That means when they run into unreadable sectors, they will
>> spend a couple minutes trying "extra hard" to get the data.
>>
>> But linux is only going to wait 30 seconds.  Then it will reset the SATA
>> link and try again.  But the drive will *not* give up its error recovery
>> effort, and will not even *talk* to the linux driver in the meantime, so
>> the linux driver will disconnect the drive and report errors for all
>> remaining requests.  This will cause MD to kick the drive out.
>>
>> You only have one choice:
>>
>> 1) Set a long timeout in the linux drivers for the drives in your array,
>> on every boot.  Something like:
>>
>> for x in /sys/block/sd[cdef]/device/timeout ; do echo 180 >$x ; done
>>
>> If you had slightly better drives, SCTERC would be supported.  On
>> desktop drives at power up, it is disabled.  But you would be able to
>> enable a normal 7.0 second timeout in the drives using smartctl.  (In a
>> script, on every boot up.)  Enterprise "raid" drives do this by default.
>>
>> [trim /]
>>
>>> smartctl -x /dev/sdd1
>>> smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-23-generic] (local
>>> build)
>>> Copyright (C) 2002-11 by Bruce Allen,
>>> http://smartmontools.sourceforge.net
>> [trim /]
>>
>>> SMART Attributes Data Structure revision number: 16
>>> Vendor Specific SMART Attributes with Thresholds:
>>> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
>>>    1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    534
>>>    3 Spin_Up_Time            POS--K   172   171   021    -    6383
>>>    4 Start_Stop_Count        -O--CK   100   100   000    -    586
>>>    5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    2
>> You already have two relocations on this drive.
>>
>>>    7 Seek_Error_Rate         -OSR-K   100   253   000    -    0
>>>    9 Power_On_Hours          -O--CK   085   085   000    -    11487
>> In less than two years.  You should pay close attention to this.
>>
>> Phil
> i think i must learn to interpret the smart values better.
> thank you.
> i will send you tomorrow my new info with the older mdadm version.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html