Re: Re: [PATCH] raid5: Avoid doing more read on dev of a stripe at the same time

"Jianpeng Ma" <majianpeng@xxxxxxxxx> · Thu, 20 Sep 2012 11:04:46 +0800



On 2012-09-20 10:51 NeilBrown <neilb@xxxxxxx> Wrote:
>On Sat, 15 Sep 2012 10:20:35 +0800 "Jianpeng Ma" <majianpeng@xxxxxxxxx> wrote:
>
>> In func 'ops_run_bio' if you read the dev which the last reading
>> of this dev didn't return,it will destrory the  req/rreq'source of rdev.
>> It may call hung-task.
>> For example, for badsector or other reasons, read-operation only used
>> stripe instead of chunk_aligned_read.
>> First:stripe 0;second: stripe 8;third:stripe 16.At the block-layer,three
>> bios merged.
>> Because media error of sector from 0 to 7, the request retried.
>> At this time, raid5d readed stripe0 again.But it will set 'bio->next =
>> NULL'.So the stripe 8 and 16 didn't return.
>> 
>> Signed-off-by: Jianpeng Ma <majianpeng@xxxxxxxxx>
>
>Hi,
> I'm really trying, but I cannot understand what you are saying.
>
Sorry for my bad english.
>I think the situation that you are describing involves a 24 sector request.
>This is attached to 3 stripe_heads - 8 sectors each - at address 0, 8, 16.
>
>So 'toread' on the first device of each stripe points to this bio, and
>bi_next is NULL.
>
>The "req" bio for each device is filled out to read one page and these three
>'req' bios are submitted.  The block layer merges these into a single request.
>
>This request reports an error because there is a read error somewhere in the
>first  8 sectors.
>
Yes,
>So one, or maybe all, of the 'req' bios return with an error?
>From my test, when req did not return and at the same time, the bio(stripe 0) send.
So this operation will set bi_next is NULL.
Why am i mention the badsector? Because my disks which had badsector can reproduced easily.
For disks which didn't have badsector,i can't reproduced this bug.

>Then RAID5 marks that device as being bad and reads from the other devices to
>recover the data.
>
>Where exactly does something go wrong?  which bio gets bi_next set to NULL
>when it shouldn't?
>
>If you could explain with lots of detail it would help a lot.
>
The question is to ensure the first-read returned and the next to send on same dev of stripe.
>Thanks,
>NeilBrown
>
>
>> ---
>>  drivers/md/raid5.c |   13 ++++++++++---
>>  drivers/md/raid5.h |    1 +
>>  2 files changed, 11 insertions(+), 3 deletions(-)
>> 
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index adda94d..e52ba1b 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -547,9 +547,14 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
>>  				rw = WRITE_FUA;
>>  			else
>>  				rw = WRITE;
>> -		} else if (test_and_clear_bit(R5_Wantread, &sh->dev[i].flags))
>> -			rw = READ;
>> -		else if (test_and_clear_bit(R5_WantReplace,
>> +		} else if (test_and_clear_bit(R5_Wantread, &sh->dev[i].flags)) {
>> +			/*The last read did not return,so skip this read*/
>> +			if (test_and_set_bit(R5_Reading, &sh->dev[i].flags)) {
>> +				clear_bit(R5_LOCKED, &sh->dev[i].flags);
>> +				continue;
>> +			} else
>> +				rw = READ;
>> +		} else if (test_and_clear_bit(R5_WantReplace,
>>  					    &sh->dev[i].flags)) {
>>  			rw = WRITE;
>>  			replace_only = 1;
>> @@ -700,6 +705,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
>>  			pr_debug("skip op %ld on disc %d for sector %llu\n",
>>  				bi->bi_rw, i, (unsigned long long)sh->sector);
>>  			clear_bit(R5_LOCKED, &sh->dev[i].flags);
>> +			clear_bit(R5_Reading, &sh->dev[i].flags);
>>  			set_bit(STRIPE_HANDLE, &sh->state);
>>  		}
>>  	}
>> @@ -1818,6 +1824,7 @@ static void raid5_end_read_request(struct bio * bi, int error)
>>  	}
>>  	rdev_dec_pending(rdev, conf->mddev);
>>  	clear_bit(R5_LOCKED, &sh->dev[i].flags);
>> +	clear_bit(R5_Reading, &sh->dev[i].flags);
>>  	set_bit(STRIPE_HANDLE, &sh->state);
>>  	release_stripe(sh);
>>  }
>> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
>> index a9fc249..a596df8 100644
>> --- a/drivers/md/raid5.h
>> +++ b/drivers/md/raid5.h
>> @@ -298,6 +298,7 @@ enum r5dev_flags {
>>  	R5_WantReplace, /* We need to update the replacement, we have read
>>  			 * data in, and now is a good time to write it out.
>>  			 */
>> +	R5_Reading,	/* this dev on reading on lld*/
>>  };
>>  
>>  /*
>
>?韬{.n?????%??檩??w?{.n???{炳盯w???塄}?财??j:+v??????2??璀??摺?囤??z夸z罐?+?????w棹f