Re: RAID1 seems not to be able to scrub pending sectors shown by smart

Phil Turmel <philip@xxxxxxxxxx> · Sat, 24 Dec 2011 09:27:45 -0500

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Philip,

On 12/24/2011 05:07 AM, Philip Hands wrote:
[...]
> Last night I started a check of the RAID that contained most of the errors on
> that disk, and it's pretty much finished (81%), in which time the Pending
> sector count is back up to 53. [Erm, 83% and 54 now -- while writing
> this mail]
> 
> Clearly it's not a particularly happy drive, so I guess that smart will
> eventually diagnose it as faulty, but in the mean time it may be a
> useful test case for mdadm.
> 
> One of those newly pending sectors was found almost immediately, as I
> was able to see from the logs, and while that was being dealt with, it
> drove the system load up to about 18, and rendered the system
> unresponsive for at least 10 seconds, probably more like 20 or 30 (the
> normal load once it had chance to settle down again was about 2, on a 6
> core CPU, so it wasn't really that busy).
> 
> [84% and 55 pending now -- with the first indication being a spike in
> load, followed a minute or two later by mention of the read problems in
> the logs, but apparently nothing logged by md, so presumably the read
> eventually succeeded]
> 
>> I wonder if a patch might be possible that allows one to put an array 
>> into a mode (or go into said mode once a badblock condition has 
>> happened) that causes it to read from at least 2 possible data sources 
>> and return whichever gets there first...
> 
> Well, given that something appears to be blocking in a fairly
> disastrous way on the read that's not coming back, I was wondering if
> there might be some way of having a timeout on those reads that if one
> gets no response for long enough (say 10 seconds) reacts by getting the
> data from elsewhere, and overwriting the slow sector.

Have you set up TLER or SCTERC on these drives?  I suspect you haven't, as
these long delays on read errors are typical of default error handling on
consumer drives.

Can you show the complete "smartctl -x" output for this failing drive?

Phil
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk714VwACgkQBP+iHzflm3BXmACffzNuNvh98KueHKUL06e9Ultj
ETcAn20P84PxbN3n6K0BlDoNsMpg1+2n
=2gBn
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html