Re: Suggestion needed for fixing RAID6

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 04/27/2010 05:50 PM, Janos Haar wrote:

----- Original Message ----- From: "MRK" <mrk@xxxxxxxxxxxxx>
To: "Janos Haar" <janos.haar@xxxxxxxxxxxx>
Cc: <linux-raid@xxxxxxxxxxxxxxx>
Sent: Monday, April 26, 2010 6:53 PM
Subject: Re: Suggestion needed for fixing RAID6


On 04/26/2010 02:52 PM, Janos Haar wrote:

Oops, you are right!
It was my mistake.
Sorry, i will try it again, to support 2 drives with dm-cow.
I will try it.

Great! post here the results... the dmesg in particular.
The dmesg should contain multiple lines like this "raid5:md3: read error corrected ....."
then you know it worked.

I am affraid i am still right about that....

...
end_request: I/O error, dev sdh, sector 1667152256
raid5:md3: read error not correctable (sector 1662188168 on dm-1).
raid5: Disk failure on dm-1, disabling device.
raid5: Operation continuing on 10 devices.

I think I can see a problem here:
You had 11 active devices over 12 when you received the read error.
At 11 devices over 12 your array is singly-degraded and this should be enough for raid6 to recompute the block from parity and perform the rewrite, correcting the read-error, but instead MD declared that it's impossible to correct the error, and dropped one more device (going to doubly-degraded).

I think this is an MD bug, and I think I know where it is:


--- linux-2.6.33-vanilla/drivers/md/raid5.c 2010-02-24 19:52:17.000000000 +0100
+++ linux-2.6.33/drivers/md/raid5.c     2010-04-27 23:58:31.000000000 +0200
@@ -1526,7 +1526,7 @@ static void raid5_end_read_request(struc

                clear_bit(R5_UPTODATE, &sh->dev[i].flags);
                atomic_inc(&rdev->read_errors);
-               if (conf->mddev->degraded)
+               if (conf->mddev->degraded == conf->max_degraded)
                        printk_rl(KERN_WARNING
                                  "raid5:%s: read error not correctable "
                                  "(sector %llu on %s).\n",

------------------------------------------------------
(This is just compile-tested so try at your risk)

I'd like to hear what Neil thinks of this...

The problem here (apart from the erroneous error message) is that if execution goes inside that "if" clause, it will eventually reach the md_error() statement some 30 lines below there, which will have the effect of dropping one further device further worsening the situation instead of recovering it, and this is not the correct behaviour in this case as far as I understand. At the current state raid6 behaves like if it was a raid5, effectively supporting only one failed disk.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux