Re: md_raid5 using 100% CPU and hang with status resync=PENDING, if a drive is removed during initialization

NeilBrown <neilb@xxxxxxx> · Wed, 18 Feb 2015 12:16:56 +1100

On Tue, 17 Feb 2015 20:07:24 -0500 Jes Sorensen <Jes.Sorensen@xxxxxxxxxx>
wrote:

> Jes Sorensen <Jes.Sorensen@xxxxxxxxxx> writes:
> > NeilBrown <neilb@xxxxxxx> writes:
> >> On Tue, 17 Feb 2015 19:03:30 -0500 Jes Sorensen <Jes.Sorensen@xxxxxxxxxx>
> >> wrote:
> >>
> >>> Jes Sorensen <Jes.Sorensen@xxxxxxxxxx> writes:
> >>> > Jes Sorensen <Jes.Sorensen@xxxxxxxxxx> writes:
> >>> >> NeilBrown <neilb@xxxxxxx> writes:
> >>> >>> On Mon, 2 Feb 2015 07:10:14 +0000 Manibalan P <pmanibalan@xxxxxxxxxxxxxx>
> >>> >>> wrote:
> >>> >>>
> >>> >>>> Dear All,
> >>> >>>> 	Any updates on this issue.
> >>> >>>
> >>> >>> Probably the same as:
> >>> >>>
> >>> >>>   http://marc.info/?l=linux-raid&m=142283560704091&w=2
> >>> >>
> >>> >> Hi Neil,
> >>> >>
> >>> >> I ran some tests on this one against the latest Linus' tree as of today
> >>> >> (1fa185ebcbcefdc5229c783450c9f0439a69f0c1) which I believe includes all
> >>> >> your pending 3.20 patches.
> >>> >>
> >>> >> I am able to reproduce Manibalan's hangs on a system with 4 SSDs if I
> >>> >> run fio on top of a device while it is resyncing and I fail one of the
> >>> >> devices.
> >>> >
> >>> > Since Manibalan mentioned this issue wasn't present in earlier kernels,
> >>> > I started trying to track down what change caused it.
> >>> >
> >>> > So far I have been able to reproduce the hang as far back as 3.10.
> >>> 
> >>> After a lot of bisecting I finally traced the issue back to this commit:
> >>> 
> >>> a7854487cd7128a30a7f4f5259de9f67d5efb95f is the first bad commit
> >>> commit a7854487cd7128a30a7f4f5259de9f67d5efb95f
> >>> Author: Alexander Lyakas <alex.bolshoy@xxxxxxxxx>
> >>> Date:   Thu Oct 11 13:50:12 2012 +1100
> >>> 
> >>>     md: When RAID5 is dirty, force reconstruct-write instead of
> >>> read-modify-write.
> >>>     
> >>>     Signed-off-by: Alex Lyakas <alex@xxxxxxxxxxxxxxxxx>
> >>>     Suggested-by: Yair Hershko <yair@xxxxxxxxxxxxxxxxx>
> >>>     Signed-off-by: NeilBrown <neilb@xxxxxxx>
> >>> 
> >>> If I revert that one I cannot reproduce the hang, applying it reproduces
> >>> the hang consistently.
> >>
> >> Thanks for all the research!
> >>
> >> That is consistent with what you already reported.
> >> You noted that it doesn't affect RAID6, and RAID6 doesn't have an RMW cycle.
> >>
> >> Also, one  of the early emails from Manibalan contained:
> >>
> >> handling stripe 273480328, state=0x2041 cnt=1, pd_idx=5, qd_idx=-1
> >> , check:0, reconstruct:0
> >> check 5: state 0x10 read           (null) write           (null) written           (null)
> >> check 4: state 0x11 read           (null) write           (null) written           (null)
> >> check 3: state 0x0 read           (null) write           (null) written           (null)
> >> check 2: state 0x11 read           (null) write           (null) written           (null)
> >> check 1: state 0x11 read           (null) write           (null) written           (null)
> >> check 0: state 0x18 read           (null) write ffff8808029b6b00 written           (null)
> >> locked=0 uptodate=3 to_read=0 to_write=1 failed=1 failed_num=3,-1
> >> force RCW max_degraded=1, recovery_cp=7036944 sh->sector=273480328
> >> for sector 273480328, rmw=2 rcw=1
> >>
> >> So it is forcing RCW, even though a single block update is usually handled
> >> with RMW.
> >>
> >> In this stripe, the parity disk is '5' and disk 3 has failed.
> >> That means to perform an RCW, we need to read the parity block in order
> >> to reconstruct the content of the failed disk.  And if we were to do that,
> >> we may as well just do an RMW.
> >>
> >> So I think the correct fix would be to only force RCW when the array
> >> is not degraded.
> >>
> >> So something like this:
> >>
> >> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> >> index aa76865b804b..fa8f8b94bfa8 100644
> >> --- a/drivers/md/raid5.c
> >> +++ b/drivers/md/raid5.c
> >> @@ -3170,7 +3170,8 @@ static void handle_stripe_dirtying(struct r5conf *conf,
> >>  	 * generate correct data from the parity.
> >>  	 */
> >>  	if (conf->max_degraded == 2 ||
> >> -	    (recovery_cp < MaxSector && sh->sector >= recovery_cp)) {
> >> +	    (recovery_cp < MaxSector && sh->sector >= recovery_cp &&
> >> +	     s->failed == 0)) {
> >>  		/* Calculate the real rcw later - for now make it
> >>  		 * look like rcw is cheaper
> >>  		 */
> >>
> >>
> >> I think reverting the whole patch is not necessary and discards useful
> >> functionality while the array is not degraded.
> >>
> >> Can you test this patch please?
> >
> > Actually I just tried this one - I was on my way home and grabbed food
> > on the way, and thought there was a better solution than to revert.
> >
> > I'll give your solution a spin too.
> 
> I tried your patch, as expected that also resolves the problem. Not sure
> which solution is better, so I'll let you pick.

Thanks!

> 
> Note whichever patch you choose it is applicable for stable-3.6+

3.6?? 

$ git describe --contains a7854487cd7128a30a7f4f5259
v3.7-rc1~10^2~7

so I assume 3.7.
Doesn't apply to 3.6, so I'll assume a typo.

NeilBrown
Attachment:
pgpDpsN3VMJzX.pgp

Description: OpenPGP digital signature