Re: raid5d hangs when stopping an array during reshape

NeilBrown <neilb@xxxxxxxx> · Thu, 25 Feb 2016 11:31:04 +1100

On Thu, Feb 25 2016, Shaohua Li wrote:

>
> As for the bug, write requests run in raid5d, mddev_suspend() waits for all IO,
> which waits for the write requests. So this is a clear deadlock. I think we
> should delete the check_reshape() in md_check_recovery(). If we change
> layout/disks/chunk_size, check_reshape() is already called. If we start an
> array, the .run() already handles new layout. There is no point
> md_check_recovery() check_reshape() again.

Are you sure?
Did you look at the commit which added that code?
commit b4c4c7b8095298ff4ce20b40bf180ada070812d0

When there is an IO error, reshape (or resync or recovery) will abort
and then possibly be automatically restarted.

Without the check here a reshape might be attempted on an array which
has failed.  Not sure if that would be harmful, but it would certainly
be pointless.

But you are right that this is causing the problem.
Maybe we should keep track of the size of the 'scribble' arrays and only
call resize_chunks if the size needs to change?  Similar to what
resize_stripes does.

It might also be good to put something like
  WARN_ON(current == mddev->thread->task);
in mddev_suspend() ... or whatever code would cause this sort of error
to trigger a warning early.

Thanks,
NeilBrown

>
> Artur, can you check if below works for you?
>
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 464627b..7fb1103 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8408,8 +8408,7 @@ void md_check_recovery(struct mddev *mddev)
>  		 */
>  
>  		if (mddev->reshape_position != MaxSector) {
> -			if (mddev->pers->check_reshape == NULL ||
> -			    mddev->pers->check_reshape(mddev) != 0)
> +			if (mddev->pers->check_reshape == NULL)
>  				/* Cannot proceed */
>  				goto not_running;
>  			set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
>
> Thanks,
> Shaohua
Attachment:
signature.asc

Description: PGP signature