Re: Help raid10 recovery from 2 disks removed

Phil Turmel <philip@xxxxxxxxxx> · Thu, 24 Oct 2013 08:16:50 -0400

Good morning,

On 10/24/2013 06:14 AM, yuji_touya@xxxxxxxxxxxxxxxxxxxx wrote:
> Mikael,

[trim /]

>> You need to figure out what happened to get sdb kicked out of the array,
>> check logs and "dmesg". Also use smartctl to check sdb and see if it's
>> failing.

[trim /]

> Device Model:     ST2000DM001-9YN164

If I recall correctly, this model doesn't support error recovery
control.  If you haven't fixed your driver timeouts, it explains your
situation.

> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x000f   115   097   006    Pre-fail  Always       -       88125160
>   3 Spin_Up_Time            0x0003   093   093   000    Pre-fail  Always       -       0
>   4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       14
>   5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0

No reallocations...

> 197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       112
> 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       112

But many sectors waiting for rewrite (which will either fix them or
reallocate them).  Rewrites can't succeed in normal MD operation with
mismatched timeouts.

If you search the archives for various combinations of "scterc",
"timeout mismatch", "URE" and "error recovery", you'll find numerous
discussion of this problem and ways to mitigate it.  (More like horror
stories, to be honest.)  Most importantly, plan to buy RAID-capable
drives in the future.

HTH,

Phil

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html