On Wed, 28 Apr 2010, Neil Brown wrote:
I think I can see a problem here:
You had 11 active devices over 12 when you received the read error.
At 11 devices over 12 your array is singly-degraded and this should be
enough for raid6 to recompute the block from parity and perform the
rewrite, correcting the read-error, but instead MD declared that it's
impossible to correct the error, and dropped one more device (going to
doubly-degraded).
I think this is an MD bug, and I think I know where it is:
--- linux-2.6.33-vanilla/drivers/md/raid5.c 2010-02-24
19:52:17.000000000 +0100
+++ linux-2.6.33/drivers/md/raid5.c 2010-04-27 23:58:31.000000000 +0200
@@ -1526,7 +1526,7 @@ static void raid5_end_read_request(struc
clear_bit(R5_UPTODATE, &sh->dev[i].flags);
atomic_inc(&rdev->read_errors);
- if (conf->mddev->degraded)
+ if (conf->mddev->degraded == conf->max_degraded)
printk_rl(KERN_WARNING
"raid5:%s: read error not correctable "
"(sector %llu on %s).\n",
------------------------------------------------------
(This is just compile-tested so try at your risk)
I'd like to hear what Neil thinks of this...
I think you've found a real bug - thanks.
It would make the test '>=' rather than '==' as that is safer, otherwise I
agree.
- if (conf->mddev->degraded)
+ if (conf->mddev->degraded >= conf->max_degraded)
If a raid6 device handling can reach this code path, could I also point
out that the message says "raid5" and that this is confusing if it's
referring to a degraded raid6?
--
Mikael Abrahamsson email: swmike@xxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html